{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction: Prediction Engineering: Labeling Historical Examples\n",
    "\n",
    "In this notebook, we will develop a method for labeling customer transactions data for a customer churn prediction problem. The objective of labeling is to create a set of historical examples of what we want to predict based on the business need: in this problem, our goal is to predict customer churn, so we want to create labeled examples of past churn from the data.\n",
    "\n",
    "The end outcome of this notebook is a set of labels each with an associated cutoff time in a table called a label times table. These labels with cutoff times can later be used in Featuretools for automated feature engineering. These features in turn will be used to train a predictive model to forecast customer churn, a common need for subscription-based business models, and one for which machine learning is well-suited.\n",
    "\n",
    "The process of prediction engineering is shown below:\n",
    "\n",
    "![](../images/prediction_engineering_process.png)\n",
    "\n",
    "## Definition of Churn: Prediction Problems\n",
    "\n",
    "The definition of churn is __a customer going without an active membership for a certain number of days.__ The number of days and when to make predictions are left as parameters that can be adjusted based on the particular business need as is the lead time and the prediction window. In this notebook, we'll make labels for two scenarios:\n",
    "\n",
    "1. Monthly churn\n",
    "    * Prediction date = first of month\n",
    "    * Number of days to churn = 31\n",
    "    * Lead time = 1 month\n",
    "    * Prediction window = 1 month\n",
    "2. Bimonthly churn\n",
    "    * Prediction date = first and fifteenth of month\n",
    "    * Number of days to churn = 14\n",
    "    * Lead time = 2 weeks\n",
    "    * Prediction window = 2 weeks\n",
    "    \n",
    "The problem parameters with details filled in for the first situation are shown below:\n",
    "\n",
    "![](../images/churn_definition.png)\n",
    "\n",
    "### Dataset\n",
    "\n",
    "The [data (publicly available)](https://www.kaggle.com/c/kkbox-churn-prediction-challenge/data) consists of customer transactions for [KKBOX](https://www.kkbox.com), the leading music subscription streaming service in Asia.\n",
    "For each customer, we have background information (in `members`), logs of listening behavior (in `logs`), and transactions information (in `trans`). The only data we need for labeling is the _transactions information_.\n",
    "\n",
    "The transactions data consists of a number of variables, the most important of which are customer id (`msno`), the date of transaction (`transaction_date`), and the expiration date of the membership (`membership_expire_date`). Using these columns, we can find each churn for each customer and the corresponding date on which it occurred. Let's look at a few typical examples of customer transaction data to illustrate how to find a churn example. For these examples, we will use the first prediction problem.\n",
    "\n",
    "## Churn Examples\n",
    "\n",
    "__Example 1:__\n",
    "\n",
    "```\n",
    "(transaction_date, membership_expire_date, is_cancel)\n",
    "\n",
    "(2017-01-01, 2017-02-28, false)\n",
    "\n",
    "(2017-02-25, 0217-03-15, false)\n",
    "\n",
    "(2017-04-31, 3117-05-20, false)\n",
    "```\n",
    "This customer is a churn because they go without a membership for over 31 days, from 03-15 to 04-31. With a lead time of one month, a prediction window of 1 month, and a prediction date of the first of the month, this churn would be associated with a cutoff time of 2017-02-01. \n",
    "\n",
    "__Example 2:__\n",
    "```\n",
    "(transaction_date, membership_expire_date, is_cancel)\n",
    "\n",
    "(2017-01-01, 2017-02-28, false)\n",
    "\n",
    "(2017-02-25, 2017-04-03, false)\n",
    "\n",
    "(2017-03-15, 2017-03-16, true)\n",
    "\n",
    "(2017-04-01, 3117-06-31, false)\n",
    "```\n",
    "\n",
    "This customer is not a churn. Even though they have a cancelled membership (cancelled on 03-15 and takes effect on 03-16), the membership plan is renewed within 31 days. \n",
    "\n",
    "__Example 3:__\n",
    "```\n",
    "(transaction_date, membership_expire_date, is_cancel)\n",
    "\n",
    "(2017-05-31, 2017-06-31, false)\n",
    "\n",
    "(2017-07-01, 2017-08-01, false)\n",
    "\n",
    "(2017-08-01, 2017-09-01, false)\n",
    "\n",
    "(2017-10-15, 2017-11-15, false)\n",
    "```\n",
    "This customer is a churn because they go without a membership for over 31 days, from 09-01 to 10-15. The associated cutoff time of this churn in 2017-09-01. \n",
    "\n",
    "These three examples illustrate different situations that occur in the data. Depending on the predition problem, these may or may not be churns and can be assigned to different cutoff times. \n",
    "\n",
    "# Approach\n",
    "\n",
    "Given the data above, to find each example of churn, we need to find the difference between one `membership_expire_date` and the next `transaction_date`. If this period is greater than the days selected for a churn, then this is a positive example of churn. For each churn, we can find the exact date on which it occurred by adding the number of days for a churn to the `membership_expire_date` associated with the churn. We create a set of cutoff times using the prediction date parameter and then for each positive label, determine the cutoff time for the churn. As an example, if the churn occurs on 09-15 with a lead time of 1 month and a prediction window of 1 month, then this churn gets the cutoff time 08-01. Cutoff times where the customer was active 1-2 months out (for this problem) will receive a negative label, and, cutoff times where we cannot determine whether the customer was active or was a churn, will not be labeled. \n",
    "\n",
    "We can very rapidly label customer transactions by shifting each `transaction_date` back by one and matching it to the previous `membership_expire_date`. We then find the difference in days between these two (`transaction` - `expire`) and if the difference is greater than the number of days established for churn, this is a positive label. Once we have these positive labels, associating them with a cutoff time is straightforward. \n",
    "\n",
    "If this is not clear, we'll shortly see how to do it in code which should clear things up! \n",
    "\n",
    "The general framework is implemented in two functions:\n",
    "\n",
    "1. `label_customer(customer_id, transactions, **params)`\n",
    "2. `make_label_times(transactions, **params)` \n",
    "\n",
    "The first takes a single member and returns a table of cutoff times for the member along with the associated labels. The second goes through all of the customers and applies the `customer_to_label_times` function to each one. The end outcome is a single table consisting of the label times for each customer. Since we already partitioned the data, we can run this function over multiple partitions in parallel to rapidly label all the data.\n",
    "\n",
    "## Cutoff Times\n",
    "\n",
    "A critical part of the label times table is the cutoff time associated with each label. This time at which we make a prediction are referred to as _cutoff_ times and they represent when all our data for making features for that particular label must be before. For instance, if our cutoff time is July 1, and we want to make predictions of churn during the month of August, all of our features for this label must be made with data from before July 1. Cutoff times are a critical consideration when feature engineering for time-series problems to prevent data leakage. Later when we go to perform automated feature engineering, Featuretools will automatically filter data based on the cutoff times so we don't have to worry about invalid training data.\n",
    "\n",
    "### Outcome\n",
    "\n",
    "Our overall goal is to build two functions that will generate labels for customers. We can then run this function over our partitions in parallel (our data has been partitioned in 1000 segments, each containing a random subset of customers). Once the label dataframes with cutoff times have been created, we can use them for automated feature engineering using Featuretools.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:37:09.057824Z",
     "start_time": "2018-10-31T16:37:08.878346Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Storage \n",
    "\n",
    "All of the data is stored and written to AWS S3. The work was completed on AWS EC2 instances which makes retrieving and writing data to S3 extremely fast. The data is publicly readable from the bucket but you'll have to configure AWS with your credentials. \n",
    "* For reading, run `aws configure` from the command line and fill in the details\n",
    "* For writing with the `s3fs` library, you'll need to provide your credentials as below\n",
    "\n",
    "The benefits of using S3 are that if we shut off our machines, we don't have to worry about losing any of the data. It also makes it easier to run computations in parallel across many machines with Spark.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:37:11.202965Z",
     "start_time": "2018-10-31T16:37:09.130315Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>msno</th>\n",
       "      <th>payment_method_id</th>\n",
       "      <th>payment_plan_days</th>\n",
       "      <th>plan_list_price</th>\n",
       "      <th>actual_amount_paid</th>\n",
       "      <th>is_auto_renew</th>\n",
       "      <th>transaction_date</th>\n",
       "      <th>membership_expire_date</th>\n",
       "      <th>is_cancel</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>G7TmHc9Gg2t8ovG/KFaB53We/0CQPELhZ5UUN2Ol3AQ=</td>\n",
       "      <td>39</td>\n",
       "      <td>30</td>\n",
       "      <td>149</td>\n",
       "      <td>149</td>\n",
       "      <td>1</td>\n",
       "      <td>2015-09-30</td>\n",
       "      <td>2015-11-13</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>LPbp8N7VRuqEISEVim8ppTaeYJG/rWS/t4g/dEFuWjw=</td>\n",
       "      <td>34</td>\n",
       "      <td>30</td>\n",
       "      <td>149</td>\n",
       "      <td>149</td>\n",
       "      <td>1</td>\n",
       "      <td>2016-02-29</td>\n",
       "      <td>2016-03-31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>xvYqULBWzJvN8heyFtY3hbY3egyQNbXuDx0igtsoi00=</td>\n",
       "      <td>29</td>\n",
       "      <td>30</td>\n",
       "      <td>180</td>\n",
       "      <td>180</td>\n",
       "      <td>1</td>\n",
       "      <td>2017-01-31</td>\n",
       "      <td>2017-03-01</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>UR4iin4mAkajoa7o+AyTTmz5k3N2GR3/rZY8a4KwADI=</td>\n",
       "      <td>41</td>\n",
       "      <td>30</td>\n",
       "      <td>99</td>\n",
       "      <td>99</td>\n",
       "      <td>1</td>\n",
       "      <td>2017-01-31</td>\n",
       "      <td>2017-02-28</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ax8CRhY8BMRA/ZvT1wI+2N/EdPXiSPGxa9y7bntA1Uc=</td>\n",
       "      <td>40</td>\n",
       "      <td>30</td>\n",
       "      <td>149</td>\n",
       "      <td>149</td>\n",
       "      <td>1</td>\n",
       "      <td>2016-05-04</td>\n",
       "      <td>2016-06-08</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           msno  payment_method_id  \\\n",
       "0  G7TmHc9Gg2t8ovG/KFaB53We/0CQPELhZ5UUN2Ol3AQ=                 39   \n",
       "1  LPbp8N7VRuqEISEVim8ppTaeYJG/rWS/t4g/dEFuWjw=                 34   \n",
       "2  xvYqULBWzJvN8heyFtY3hbY3egyQNbXuDx0igtsoi00=                 29   \n",
       "3  UR4iin4mAkajoa7o+AyTTmz5k3N2GR3/rZY8a4KwADI=                 41   \n",
       "4  ax8CRhY8BMRA/ZvT1wI+2N/EdPXiSPGxa9y7bntA1Uc=                 40   \n",
       "\n",
       "   payment_plan_days  plan_list_price  actual_amount_paid  is_auto_renew  \\\n",
       "0                 30              149                 149              1   \n",
       "1                 30              149                 149              1   \n",
       "2                 30              180                 180              1   \n",
       "3                 30               99                  99              1   \n",
       "4                 30              149                 149              1   \n",
       "\n",
       "  transaction_date membership_expire_date  is_cancel  \n",
       "0       2015-09-30             2015-11-13          0  \n",
       "1       2016-02-29             2016-03-31          0  \n",
       "2       2017-01-31             2017-03-01          0  \n",
       "3       2017-01-31             2017-02-28          0  \n",
       "4       2016-05-04             2016-06-08          0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "PARTITION = '100'\n",
    "BASE_DIR = 's3://customer-churn-spark/'\n",
    "PARTITION_DIR = BASE_DIR + 'p' + PARTITION\n",
    "\n",
    "members = pd.read_csv(f'{PARTITION_DIR}/members.csv', \n",
    "                      parse_dates=['registration_init_time'], infer_datetime_format = True)\n",
    "trans = pd.read_csv(f'{PARTITION_DIR}/transactions.csv',\n",
    "                   parse_dates=['transaction_date', 'membership_expire_date'], infer_datetime_format = True)\n",
    "logs = pd.read_csv(f'{PARTITION_DIR}/logs.csv', parse_dates = ['date'])\n",
    "\n",
    "trans.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The transactions table is all we will need to make labels.  \n",
    "\n",
    "The next cell is needed for writing data back to S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:37:11.336858Z",
     "start_time": "2018-10-31T16:37:11.315697Z"
    }
   },
   "outputs": [],
   "source": [
    "import s3fs\n",
    "\n",
    "# Credentials\n",
    "with open('/data/credentials.txt', 'r') as f:\n",
    "    info = f.read().strip().split(',')\n",
    "    key = info[0]\n",
    "    secret = info[1]\n",
    "\n",
    "fs = s3fs.S3FileSystem(key=key, secret=secret)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Churn for One Customer\n",
    "\n",
    "The function below takes in a single customer's transactions along with a number of parameters that define the prediction problem. \n",
    "\n",
    "* `prediction_date`: when we want to make predictions\n",
    "* `churn_days`: the number of days without a membership required for a churn\n",
    "* `lead_time`: how long in advance to predict churn\n",
    "* `prediction_window`: the length of time we are considering for a churn . \n",
    "\n",
    "The return from `label_customer` is a label_times dataframe for the customer which has cutoff times for the specified `prediction_date` and the label at each prediction time. Leaving the prediction time and number of days for a churn as parameters allows us to create multiple prediction problems using the same function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:37:25.306907Z",
     "start_time": "2018-10-31T16:37:25.294964Z"
    }
   },
   "outputs": [],
   "source": [
    "def label_customer(customer_id, customer_transactions, prediction_date, churn_days, \n",
    "                   lead_time = 1, prediction_window = 1, return_trans = False):\n",
    "    \"\"\"\n",
    "    Make label times for a single customer. Returns a dataframe of labels with times, the binary label, \n",
    "    and the number of days until the next churn.\n",
    "       \n",
    "    Params\n",
    "    --------\n",
    "        customer_id (str): unique id for the customer\n",
    "        customer_transactions (dataframe): transactions dataframe for the customer\n",
    "        prediction_date (str): time at which predictions are made. Either \"MS\" for the first of the month\n",
    "                               or \"SMS\" for the first and fifteenth of each month \n",
    "        churn_days (int): integer number of days without an active membership required for a churn. A churn is\n",
    "                          defined by exceeding this number of days without an active membership.\n",
    "        lead_time (int): number of periods in advance to make predictions for. Defaults to 1 (preditions for one offset)\n",
    "        prediction_window(int): number of periods over which to consider churn. Defaults to 1.\n",
    "        return_trans (boolean): whether or not to return the transactions for analysis. Defaults to False.\n",
    "        \n",
    "    Return\n",
    "    --------\n",
    "        label_times (dataframe): a table of customer id, the cutoff times at the specified frequency, the \n",
    "                                 label for each cutoff time, the number of days until the next churn for each\n",
    "                                 cutoff time, and the date on which the churn itself occurred.\n",
    "        transactions (dataframe): [optional] dataframe of customer transactions if return_trans = True. Useful\n",
    "                                  for making sure that the function performed as expected\n",
    "    \n",
    "       \"\"\"\n",
    "    \n",
    "    assert(prediction_date in ['MS', 'SMS']), \"Prediction day must be either 'MS' or 'SMS'\"\n",
    "    assert(customer_transactions['msno'].unique() == [customer_id]), \"Transactions must be for only customer\"\n",
    "    \n",
    "    # Don't modify original\n",
    "    transactions = customer_transactions.copy()\n",
    "    \n",
    "    # Make sure to sort chronalogically\n",
    "    transactions.sort_values(['transaction_date', 'membership_expire_date'], inplace = True)\n",
    "    \n",
    "    # Create next transaction date by shifting back one transaction\n",
    "    transactions['next_transaction_date'] = transactions['transaction_date'].shift(-1)\n",
    "    \n",
    "    # Find number of days between membership expiration and next transaction\n",
    "    transactions['difference_days'] = (transactions['next_transaction_date'] - \n",
    "                                       transactions['membership_expire_date']).\\\n",
    "                                       dt.total_seconds() / (3600 * 24)\n",
    "    \n",
    "    # Determine which transactions are associated with a churn\n",
    "    transactions['churn'] = transactions['difference_days'] > churn_days\n",
    "    \n",
    "    # Find date of each churn\n",
    "    transactions.loc[transactions['churn'] == True, \n",
    "                     'churn_date'] = transactions.loc[transactions['churn'] == True, \n",
    "                                                      'membership_expire_date'] + pd.Timedelta(churn_days + 1, 'd')\n",
    "    \n",
    "    # Range for cutoff times is from first to (last + 1 month) transaction\n",
    "    first_transaction = transactions['transaction_date'].min()\n",
    "    last_transaction = transactions['transaction_date'].max()\n",
    "    start_date = pd.datetime(first_transaction.year, first_transaction.month, 1)\n",
    "    \n",
    "    # Handle December\n",
    "    if last_transaction.month == 12:\n",
    "        end_date = pd.datetime(last_transaction.year + 1, 1, 1)\n",
    "    else:\n",
    "        end_date = pd.datetime(last_transaction.year, last_transaction.month + 1, 1)\n",
    "    \n",
    "    # Make label times dataframe with cutoff times corresponding to prediction date\n",
    "    label_times = pd.DataFrame({'cutoff_time': pd.date_range(start_date, end_date, freq = prediction_date),\n",
    "                                'msno': customer_id\n",
    "                               })\n",
    "    \n",
    "    # Use the lead time and prediction window parameters to establish the prediction window \n",
    "    # Prediction window is for each cutoff time\n",
    "    label_times['prediction_window_start'] = label_times['cutoff_time'].shift(-lead_time)\n",
    "    label_times['prediction_window_end'] = label_times['cutoff_time'].shift(-(lead_time + prediction_window))\n",
    "    \n",
    "    previous_churn_date = None\n",
    "\n",
    "    # Iterate through every cutoff time\n",
    "    for i, row in label_times.iterrows():\n",
    "        \n",
    "        # Default values if unknown\n",
    "        churn_date = pd.NaT\n",
    "        label = np.nan\n",
    "        # Find the window start and end\n",
    "        window_start = row['prediction_window_start']\n",
    "        window_end = row['prediction_window_end']\n",
    "        # Determine if there were any churns during the prediction window\n",
    "        churns = transactions.loc[(transactions['churn_date'] >= window_start) & \n",
    "                                  (transactions['churn_date'] < window_end), 'churn_date']\n",
    "\n",
    "        # Positive label if there was a churn during window\n",
    "        if not churns.empty:\n",
    "            label = 1\n",
    "            churn_date = churns.values[0]\n",
    "\n",
    "            # Find number of days until next churn by \n",
    "            # subsetting to cutoff times before current churn and after previous churns\n",
    "            if not previous_churn_date:\n",
    "                before_idx = label_times.loc[(label_times['cutoff_time'] <= churn_date)].index\n",
    "            else:\n",
    "                before_idx = label_times.loc[(label_times['cutoff_time'] <= churn_date) & \n",
    "                                             (label_times['cutoff_time'] > previous_churn_date)].index\n",
    "\n",
    "            # Calculate days to next churn for cutoff times before current churn\n",
    "            label_times.loc[before_idx, 'days_to_churn'] = (churn_date - label_times.loc[before_idx, \n",
    "                                                                                         'cutoff_time']).\\\n",
    "                                                            dt.total_seconds() / (3600 * 24)\n",
    "            previous_churn_date = churn_date\n",
    "        # No churns, but need to determine if an active member\n",
    "        else:\n",
    "            # Find transactions before the end of the window that were not cancelled\n",
    "            transactions_before = transactions.loc[(transactions['transaction_date'] < window_end) & \n",
    "                                                   (transactions['is_cancel'] == False)].copy()\n",
    "            # If the membership expiration date for this membership is after the window start, the custom has not churned\n",
    "            if np.any(transactions_before['membership_expire_date'] >= window_start):\n",
    "                label = 0\n",
    "\n",
    "        # Assign values\n",
    "        label_times.loc[i, 'label'] = label\n",
    "        label_times.loc[i, 'churn_date'] = churn_date\n",
    "        \n",
    "        # Handle case with no churns\n",
    "        if not np.any(label_times['label'] == 1):\n",
    "            label_times['days_to_churn'] = np.nan\n",
    "            label_times['churn_date'] = pd.NaT\n",
    "        \n",
    "    if return_trans:\n",
    "        return label_times.drop(columns = ['msno']), transactions\n",
    "    \n",
    "    return label_times[['msno', 'cutoff_time', 'label', 'days_to_churn', 'churn_date']].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the output of this function for a typical customer. We'll take the use case of making predictions on the first of each month with 31 days required for a churn, a lead time of 1 month, and a prediction window of 1 month."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:37:26.504521Z",
     "start_time": "2018-10-31T16:37:26.264121Z"
    },
    "code_folding": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cutoff_time</th>\n",
       "      <th>prediction_window_start</th>\n",
       "      <th>prediction_window_end</th>\n",
       "      <th>label</th>\n",
       "      <th>churn_date</th>\n",
       "      <th>days_to_churn</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015-09-01</td>\n",
       "      <td>2015-10-01</td>\n",
       "      <td>2015-11-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>198.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2015-10-01</td>\n",
       "      <td>2015-11-01</td>\n",
       "      <td>2015-12-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>168.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2015-11-01</td>\n",
       "      <td>2015-12-01</td>\n",
       "      <td>2016-01-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>137.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2015-12-01</td>\n",
       "      <td>2016-01-01</td>\n",
       "      <td>2016-02-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>107.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-01-01</td>\n",
       "      <td>2016-02-01</td>\n",
       "      <td>2016-03-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>76.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2016-02-01</td>\n",
       "      <td>2016-03-01</td>\n",
       "      <td>2016-04-01</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2016-03-17</td>\n",
       "      <td>45.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2016-03-01</td>\n",
       "      <td>2016-04-01</td>\n",
       "      <td>2016-05-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "      <td>16.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2016-04-01</td>\n",
       "      <td>2016-05-01</td>\n",
       "      <td>2016-06-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2016-05-01</td>\n",
       "      <td>2016-06-01</td>\n",
       "      <td>2016-07-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2016-06-01</td>\n",
       "      <td>2016-07-01</td>\n",
       "      <td>2016-08-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  cutoff_time prediction_window_start prediction_window_end  label churn_date  \\\n",
       "0  2015-09-01              2015-10-01            2015-11-01    0.0        NaT   \n",
       "1  2015-10-01              2015-11-01            2015-12-01    0.0        NaT   \n",
       "2  2015-11-01              2015-12-01            2016-01-01    0.0        NaT   \n",
       "3  2015-12-01              2016-01-01            2016-02-01    0.0        NaT   \n",
       "4  2016-01-01              2016-02-01            2016-03-01    0.0        NaT   \n",
       "5  2016-02-01              2016-03-01            2016-04-01    1.0 2016-03-17   \n",
       "6  2016-03-01              2016-04-01            2016-05-01    NaN        NaT   \n",
       "7  2016-04-01              2016-05-01            2016-06-01    0.0        NaT   \n",
       "8  2016-05-01              2016-06-01            2016-07-01    0.0        NaT   \n",
       "9  2016-06-01              2016-07-01            2016-08-01    0.0        NaT   \n",
       "\n",
       "   days_to_churn  \n",
       "0          198.0  \n",
       "1          168.0  \n",
       "2          137.0  \n",
       "3          107.0  \n",
       "4           76.0  \n",
       "5           45.0  \n",
       "6           16.0  \n",
       "7            NaN  \n",
       "8            NaN  \n",
       "9            NaN  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "CUSTOMER_ID = trans.iloc[8, 0]\n",
    "customer_transactions = trans.loc[trans['msno'] == CUSTOMER_ID].copy()\n",
    "\n",
    "label_times, cust_transactions = label_customer(CUSTOMER_ID, customer_transactions, \n",
    "                                                prediction_date = 'MS', churn_days = 31, \n",
    "                                                lead_time = 1, prediction_window = 1, return_trans = True)\n",
    "label_times.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make sure the function worked, we'll want to take a look at the transactions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:37:27.355650Z",
     "start_time": "2018-10-31T16:37:27.345963Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transaction_date</th>\n",
       "      <th>membership_expire_date</th>\n",
       "      <th>is_cancel</th>\n",
       "      <th>next_transaction_date</th>\n",
       "      <th>difference_days</th>\n",
       "      <th>churn</th>\n",
       "      <th>churn_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20245</th>\n",
       "      <td>2015-12-25</td>\n",
       "      <td>2016-01-25</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-01-25</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4568</th>\n",
       "      <td>2016-01-25</td>\n",
       "      <td>2016-02-25</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-02-14</td>\n",
       "      <td>-11.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1032</th>\n",
       "      <td>2016-02-14</td>\n",
       "      <td>2016-02-14</td>\n",
       "      <td>1</td>\n",
       "      <td>2016-05-22</td>\n",
       "      <td>98.0</td>\n",
       "      <td>True</td>\n",
       "      <td>2016-03-17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5789</th>\n",
       "      <td>2016-05-22</td>\n",
       "      <td>2016-06-21</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-06-21</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17673</th>\n",
       "      <td>2016-06-21</td>\n",
       "      <td>2016-07-21</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-07-21</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6313</th>\n",
       "      <td>2016-07-21</td>\n",
       "      <td>2016-08-21</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-08-21</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17857</th>\n",
       "      <td>2016-08-21</td>\n",
       "      <td>2016-09-21</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-09-21</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      transaction_date membership_expire_date  is_cancel  \\\n",
       "20245       2015-12-25             2016-01-25          0   \n",
       "4568        2016-01-25             2016-02-25          0   \n",
       "1032        2016-02-14             2016-02-14          1   \n",
       "5789        2016-05-22             2016-06-21          0   \n",
       "17673       2016-06-21             2016-07-21          0   \n",
       "6313        2016-07-21             2016-08-21          0   \n",
       "17857       2016-08-21             2016-09-21          0   \n",
       "\n",
       "      next_transaction_date  difference_days  churn churn_date  \n",
       "20245            2016-01-25              0.0  False        NaT  \n",
       "4568             2016-02-14            -11.0  False        NaT  \n",
       "1032             2016-05-22             98.0   True 2016-03-17  \n",
       "5789             2016-06-21              0.0  False        NaT  \n",
       "17673            2016-07-21              0.0  False        NaT  \n",
       "6313             2016-08-21              0.0  False        NaT  \n",
       "17857            2016-09-21              0.0  False        NaT  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cust_transactions.iloc[3:10, -7:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the churn occurred on 2016-03-16 as the customer went 98 days between an active membership from 2016-02-14 to 2016-05-22. The actual churn occurs 31 days from when the membership expires. The churn is only associated with one cutoff time, 2016-02-01. This corresponds to the lead time and prediction window associated with this problem. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see the function in use for the other prediction problem, making predictions on the first and fifteenth of each month with churn defined as more than 14 days without an active membership. The lead time is set to two weeks (one prediction period) and the prediction window is also set to two weeks. To change the prediction problem, all we need to do is alter the parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:37:28.033784Z",
     "start_time": "2018-10-31T16:37:27.964454Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cutoff_time</th>\n",
       "      <th>prediction_window_start</th>\n",
       "      <th>prediction_window_end</th>\n",
       "      <th>label</th>\n",
       "      <th>churn_date</th>\n",
       "      <th>days_to_churn</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015-11-01</td>\n",
       "      <td>2015-11-15</td>\n",
       "      <td>2015-12-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>75.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2015-11-15</td>\n",
       "      <td>2015-12-01</td>\n",
       "      <td>2015-12-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>61.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2015-12-01</td>\n",
       "      <td>2015-12-15</td>\n",
       "      <td>2016-01-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>45.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2015-12-15</td>\n",
       "      <td>2016-01-01</td>\n",
       "      <td>2016-01-15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "      <td>31.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-01-01</td>\n",
       "      <td>2016-01-15</td>\n",
       "      <td>2016-02-01</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2016-01-15</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2016-01-15</td>\n",
       "      <td>2016-02-01</td>\n",
       "      <td>2016-02-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2016-02-01</td>\n",
       "      <td>2016-02-15</td>\n",
       "      <td>2016-03-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>43.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2016-02-15</td>\n",
       "      <td>2016-03-01</td>\n",
       "      <td>2016-03-15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "      <td>29.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2016-03-01</td>\n",
       "      <td>2016-03-15</td>\n",
       "      <td>2016-04-01</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2016-03-15</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2016-03-15</td>\n",
       "      <td>2016-04-01</td>\n",
       "      <td>2016-04-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2016-04-01</td>\n",
       "      <td>2016-04-15</td>\n",
       "      <td>2016-05-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2016-04-15</td>\n",
       "      <td>2016-05-01</td>\n",
       "      <td>2016-05-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cutoff_time prediction_window_start prediction_window_end  label  \\\n",
       "0   2015-11-01              2015-11-15            2015-12-01    0.0   \n",
       "1   2015-11-15              2015-12-01            2015-12-15    0.0   \n",
       "2   2015-12-01              2015-12-15            2016-01-01    0.0   \n",
       "3   2015-12-15              2016-01-01            2016-01-15    NaN   \n",
       "4   2016-01-01              2016-01-15            2016-02-01    1.0   \n",
       "5   2016-01-15              2016-02-01            2016-02-15    0.0   \n",
       "6   2016-02-01              2016-02-15            2016-03-01    0.0   \n",
       "7   2016-02-15              2016-03-01            2016-03-15    NaN   \n",
       "8   2016-03-01              2016-03-15            2016-04-01    1.0   \n",
       "9   2016-03-15              2016-04-01            2016-04-15    0.0   \n",
       "10  2016-04-01              2016-04-15            2016-05-01    0.0   \n",
       "11  2016-04-15              2016-05-01            2016-05-15    0.0   \n",
       "\n",
       "   churn_date  days_to_churn  \n",
       "0         NaT           75.0  \n",
       "1         NaT           61.0  \n",
       "2         NaT           45.0  \n",
       "3         NaT           31.0  \n",
       "4  2016-01-15           14.0  \n",
       "5         NaT            0.0  \n",
       "6         NaT           43.0  \n",
       "7         NaT           29.0  \n",
       "8  2016-03-15           14.0  \n",
       "9         NaT            0.0  \n",
       "10        NaT            NaN  \n",
       "11        NaT            NaN  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "CUSTOMER_ID = trans.iloc[100, 0]\n",
    "customer_transactions = trans.loc[trans['msno'] == CUSTOMER_ID].copy()\n",
    "\n",
    "label_times, cust_transactions = label_customer(CUSTOMER_ID, customer_transactions, \n",
    "                                                prediction_date = 'SMS', churn_days = 14, \n",
    "                                                lead_time = 1, prediction_window = 1, return_trans = True)\n",
    "label_times.head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several times when we can't determine if the customer churned or not because of the way the problem has been set up. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:37:28.342597Z",
     "start_time": "2018-10-31T16:37:28.332297Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transaction_date</th>\n",
       "      <th>membership_expire_date</th>\n",
       "      <th>is_cancel</th>\n",
       "      <th>next_transaction_date</th>\n",
       "      <th>difference_days</th>\n",
       "      <th>churn</th>\n",
       "      <th>churn_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2393</th>\n",
       "      <td>2015-11-29</td>\n",
       "      <td>2015-12-31</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-01-31</td>\n",
       "      <td>31.0</td>\n",
       "      <td>True</td>\n",
       "      <td>2016-01-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15856</th>\n",
       "      <td>2016-01-31</td>\n",
       "      <td>2016-02-29</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-03-31</td>\n",
       "      <td>31.0</td>\n",
       "      <td>True</td>\n",
       "      <td>2016-03-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16033</th>\n",
       "      <td>2016-03-31</td>\n",
       "      <td>2016-04-30</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-30</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7280</th>\n",
       "      <td>2016-04-30</td>\n",
       "      <td>2016-05-31</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-05-31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16884</th>\n",
       "      <td>2016-05-31</td>\n",
       "      <td>2016-06-30</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-06-30</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6370</th>\n",
       "      <td>2016-06-30</td>\n",
       "      <td>2016-07-31</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-07-31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20193</th>\n",
       "      <td>2016-07-31</td>\n",
       "      <td>2016-08-31</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-08-31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18343</th>\n",
       "      <td>2016-08-31</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6365</th>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>2016-10-31</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-10-31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11484</th>\n",
       "      <td>2016-10-31</td>\n",
       "      <td>2016-11-30</td>\n",
       "      <td>0</td>\n",
       "      <td>2016-11-30</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      transaction_date membership_expire_date  is_cancel  \\\n",
       "2393        2015-11-29             2015-12-31          0   \n",
       "15856       2016-01-31             2016-02-29          0   \n",
       "16033       2016-03-31             2016-04-30          0   \n",
       "7280        2016-04-30             2016-05-31          0   \n",
       "16884       2016-05-31             2016-06-30          0   \n",
       "6370        2016-06-30             2016-07-31          0   \n",
       "20193       2016-07-31             2016-08-31          0   \n",
       "18343       2016-08-31             2016-09-30          0   \n",
       "6365        2016-09-30             2016-10-31          0   \n",
       "11484       2016-10-31             2016-11-30          0   \n",
       "\n",
       "      next_transaction_date  difference_days  churn churn_date  \n",
       "2393             2016-01-31             31.0   True 2016-01-15  \n",
       "15856            2016-03-31             31.0   True 2016-03-15  \n",
       "16033            2016-04-30              0.0  False        NaT  \n",
       "7280             2016-05-31              0.0  False        NaT  \n",
       "16884            2016-06-30              0.0  False        NaT  \n",
       "6370             2016-07-31              0.0  False        NaT  \n",
       "20193            2016-08-31              0.0  False        NaT  \n",
       "18343            2016-09-30              0.0  False        NaT  \n",
       "6365             2016-10-31              0.0  False        NaT  \n",
       "11484            2016-11-30              0.0  False        NaT  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cust_transactions.iloc[:10, -7:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the churn on 2016-03-15, it was assigned to the `cutoff_time` of 2016-03-01 as expected with a lead time of two weeks and a prediction window of two weeks. (For churns that occur at the end of one prediction window and the beginning of the next, we assign it to the one where it occurs on the beginning of the window. This can be quickly changed by altering the logic of the function.)\n",
    "\n",
    "The function works as designed, we can pass in different parameters and rapidly make prediction problems. We also have the number of days to the churn which means we could formulate the problem as regression instead of classification. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Churn for All Customers\n",
    "\n",
    "Next, we take the function which works for one customer and apply it to all customers in a dataset. This requires a loop through the customers by grouping the customer transactions and applying `label_customer` to each customer's transactions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:37:29.596542Z",
     "start_time": "2018-10-31T16:37:29.592918Z"
    }
   },
   "outputs": [],
   "source": [
    "def make_label_times(transactions, prediction_date, churn_days, \n",
    "                   lead_time = 1, prediction_window = 1,):\n",
    "    \"\"\"\n",
    "    Make labels for an entire series of transactions. \n",
    "    \n",
    "    Params\n",
    "    --------\n",
    "        transactions (dataframe): table of customer transactions\n",
    "        prediction_date (str): time at which predictions are made. Either \"MS\" for the first of the month\n",
    "                               or \"SMS\" for the first and fifteenth of each month \n",
    "        churn_days (int): integer number of days without an active membership required for a churn. A churn is\n",
    "                          defined by exceeding this number of days without an active membership.\n",
    "        lead_time (int): number of periods in advance to make predictions for. Defaults to 1 (preditions for one offset)\n",
    "        prediction_window(int): number of periods over which to consider churn. Defaults to 1.\n",
    "    Return\n",
    "    --------\n",
    "        label_times (dataframe): a table with customer ids, cutoff times, binary label, regression label, \n",
    "                                 and date of churn. This table can then be used for feature engineering.\n",
    "    \"\"\"\n",
    "    \n",
    "    label_times = []\n",
    "    transactions = transactions.sort_values(['msno', 'transaction_date'])\n",
    "    \n",
    "    # Iterate through each customer and find labels\n",
    "    for customer_id, customer_transactions in transactions.groupby('msno'):\n",
    "        lt_cust = label_customer(customer_id, customer_transactions,\n",
    "                                                   prediction_date, churn_days, \n",
    "                                                   lead_time, prediction_window)\n",
    "        \n",
    "        label_times.append(lt_cust)\n",
    "        \n",
    "    # Concatenate into a single dataframe\n",
    "    return pd.concat(label_times)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at examples of using this function for both prediction problems.\n",
    "\n",
    "## First Prediction Problem\n",
    "\n",
    "The defintion of the first prediction problem is as follows:\n",
    "\n",
    "* Monthly churn\n",
    "    * Prediction date = first of month\n",
    "    * Number of days to churn = 31\n",
    "    * Lead time = 1 month\n",
    "    * Prediction window = 1 month"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:38:09.330501Z",
     "start_time": "2018-10-31T16:37:30.500502Z"
    }
   },
   "outputs": [],
   "source": [
    "label_times = make_label_times(trans, prediction_date = 'MS', churn_days = 31,\n",
    "                               lead_time = 1, prediction_window = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:38:09.536296Z",
     "start_time": "2018-10-31T16:38:09.526992Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>msno</th>\n",
       "      <th>cutoff_time</th>\n",
       "      <th>label</th>\n",
       "      <th>days_to_churn</th>\n",
       "      <th>churn_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2016-07-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2016-08-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2016-09-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2016-10-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2016-11-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2016-12-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-01-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-02-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-03-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-04-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            msno cutoff_time  label  \\\n",
       "18  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2016-07-01    0.0   \n",
       "19  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2016-08-01    0.0   \n",
       "20  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2016-09-01    0.0   \n",
       "21  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2016-10-01    0.0   \n",
       "22  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2016-11-01    0.0   \n",
       "23  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2016-12-01    0.0   \n",
       "24  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-01-01    0.0   \n",
       "25  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-02-01    0.0   \n",
       "26  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-03-01    NaN   \n",
       "27  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-04-01    NaN   \n",
       "\n",
       "    days_to_churn churn_date  \n",
       "18            NaN        NaT  \n",
       "19            NaN        NaT  \n",
       "20            NaN        NaT  \n",
       "21            NaN        NaT  \n",
       "22            NaN        NaT  \n",
       "23            NaN        NaT  \n",
       "24            NaN        NaT  \n",
       "25            NaN        NaT  \n",
       "26            NaN        NaT  \n",
       "27            NaN        NaT  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_times.tail(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:38:39.290256Z",
     "start_time": "2018-10-31T16:38:39.287200Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(28508, 5)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_times.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:38:40.844874Z",
     "start_time": "2018-10-31T16:38:40.840452Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0    21749\n",
       "1.0      526\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_times['label'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:38:43.586867Z",
     "start_time": "2018-10-31T16:38:43.309921Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAegAAAE8CAYAAAAL/yI1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvDW2N/gAAIABJREFUeJzt3X1cjff/B/DXUViUjlGnQoWFjOSmThgVhtyUhJgZmYWYuwnb13fYTGEz5nYbchujuSkjm25UpIgck1WbLyqpFqV8K0q/P3zP9evolKI6l/V6Ph49Ntf1Odd5X9e5eZ3r8/lc50hycnJKQURERKLSQNMFEBERUXkMaCIiIhFiQBMREYkQA5qIiEiEGNBEREQixIAmIiISIQZ0Ffj4+EAqlWL//v21dh+3b9+GVCrFzJkza+0+ypo5cyakUilu375dJ/f3vP3796s9pl27doVUKtVITUqRkZGQSqXw8fHRaB017WUec+WxqKvn5eviZd4TNP2a0zTlMYuMjFRZLpVKMXz48Fq739f5uP9jAloqlWr8jb0uKQO97J9MJkP79u3h6OiIefPmISwsDE+fPq2V+x8+fPhr+6Rn6Py/uj4Wyg9gUqkUZ8+erbDdnDlzhHbLly+vk9rKqosP5TVN+Zos+9eqVSv06dMHX375JXJycjRdYq1QBvDzwf9PoK3pAujVNGvWTHhzLSkpQU5ODm7cuIF9+/Zh165d6NWrF3744Qe0a9dO5XbLli3D/PnzYWJioomyMWLECNjY2EAmk2nk/ivTs2dPxMbGokWLFpoupUZp+jEvS1tbG3v27IG9vX25dfn5+Thy5Ai0tbVRXFysgepebxMmTICpqSlKS0uRkZGBU6dO4ZtvvsGxY8cQEhIiuhOZ2NhY6Ojo1Nr2xfS8ry4G9GtOX18fn376abnl9+7dw8KFC3HixAk4OzsjPDwcLVu2FNYbGRnByMioLktVoa+vD319fY3df2WaNGmCDh06aLqMGqfpx7ysIUOG4MSJE7h//z7efPNNlXU///wz8vPzMWLECJw4cUJDFb6+3nvvPfTr10/498qVKzFo0CAkJibihx9+wKJFizRYXXm1/VoT0/O+uv4xXdzVERERgblz50Iul6NNmzYwMjKCnZ0dVq1ahYKCgkpve+rUKbz77rswMTGBubk5pkyZgv/85z9q2xYWFmLjxo2wt7dHq1atYGJiAgcHB+zcuROlpbX7DatGRkbYvXs3+vbti9TUVHzzzTcq6ysalzl16hRcXFzQqVMnGBoaomPHjhgyZIjK7aVSKc6dOwcA6Natm9Cd1rVrV6GNsrvt1q1b2Lp1K3r37g2ZTIb33nsPQMVj0EpFRUX44osv0LVrVxgaGqJ79+5Ys2YNHj9+rNLuRWP3z++nj48PRo4cCQA4cOCASnegspbKxqBv3boFLy8vdO7cGQYGBrCwsMCUKVPw+++/l2ur3EcfHx8oFAqMGzcOpqamMDY2xrBhwxATE6O25ufl5+fDwMAAAwcOVFleXFwMU1NTSKVSfP/99yrrAgMDIZVKVR63lzkWZd2+fRtTp05Fu3btIJPJ4ODggODg4Crtw/MmT56MoqIiHDhwoNy63bt3w9jYGIMHD67w9pmZmVi0aBG6desGQ0NDtG3bFu7u7sLzsqyy3fhV2Yfhw4dj9erVAIBZs2apHBd1QzoREREYPnw4WrdujTZt2mDcuHFITEx84TFISkqCVCrFiBEjKmwzaNAgNG/eHH/99dcLt1cRPT094XUXFxcnLC/bNXzw4EE4OjrCxMQE77zzjsrtIyIiMH78eLRv3x4GBgbo0qULPvnkE2RkZKi9v/j4eLi5uQnHw8XFBbGxsRXWV9EYdElJCXbt2gUnJyeYmprCyMgI3bp1w4wZM5CQkADg2ZCJ8jk0cuRIlcfq+f1U99gFBgZixIgRMDU1hUwmg62tLb766ivk5+eXa1t2WM/Pzw99+vSBTCaDhYUF5s6di9zc3HK3+f333zFt2jRYWVlBJpOhXbt26NOnDz755BO17Z9XL8+gN2zYgKSkJMjlcgwePBiFhYWIiYnBmjVrEBkZiaCgIGhrlz80QUFBOHPmDEaOHIl+/fpBoVDg2LFjiIyMxK+//or27dsLbfPy8jBq1CjExcXByspKeIGEhIRgwYIFuHjxIrZu3Vqr+6mlpYVFixbBxcUFAQEBWLVqFSQSSYXtd+3ahXnz5sHQ0BBDhgyBgYEBsrOzkZiYCD8/P3zyyScAgMWLF8Pf3x8pKSmYMWOGcCas7ox48eLFuHDhAoYMGYLBgwdDV1e3SrVPnjwZV69exciRI9GwYUP88ssvWLVqFeLj4+Hv7/8SR+OZd955B3fu3MGBAwfQpUsXlTeGsh8w1ImPj4eLiwsePnyIwYMH4+2338Z//vMfBAUFITg4GP7+/hgwYIDa23333XewsbHBBx98gNTUVAQGBsLFxQWRkZGwsLCo9H51dXXRs2dPXLx4ETk5OcKbz+XLl/Hw4UMAQHh4OKZPny7cJiIiAgDUdiG/zLFISUnBwIEDYW5uDnd3dzx48ABHjx7Fe++9h2PHjqF///6V7sPz+vXrh7Zt22Lv3r2YNWuWsPzatWu4fPkyFi5cCC0tLbW3vXPnDpycnJCWloa+ffti9OjRuHfvHo4dO4bffvsNGzduxMSJE8vdrqr7oHytnjt3DsOGDVM5Fs8/x0+fPo2TJ09i0KBB8PDwQGJiIn799VdcvnwZMTExlQ6TdOjQAf369UNkZCSSk5PLPQ+uXbuGS5cuwd7eXuW95WVUdkKwadMmnD17Fk5OTrC3t1f5ELx+/XosX74czZs3x+DBgyGTyXD9+nXs2LEDp06dwm+//YZWrVoJ7WNiYjBq1CgUFRVh5MiRaN++Pa5fv44RI0ZU6zny+PFjuLu7IywsDK1atYKbmxv09fWRmpqKkJAQWFlZoXPnzpg5cyb8/f3x+++/C137VfXVV19h7dq1aN68OUaPHg19fX2EhYVh7dq1OHXqFE6dOgU9Pb1yt1u2bBlCQ0MxdOhQODo6IjIyErt378bNmzcRFBQktPv9998xaNAgSCQSDBkyBG3btkV+fj7u3LkDf39/zJo164W9iPUyoL/55huYmZmVC6uVK1fi66+/xvHjx+Hm5lbudsHBwfjpp58wZMgQYdnGjRvx73//G97e3jhy5Iiw/LPPPkNcXByWL1+OefPmCcuLioowadIkHDhwAM7OznBycqqFPfx/dnZ20NbWRlZWFm7fvg1zc/MK2/r5+aFRo0aIioqCoaGhyrrs7Gzh/z/99FNERUUhJSUFM2fOhJmZWYXbVCgUiIiIqLSNOsnJyYiOjhbCaOnSpRg+fDhOnjyJgIAAjBkzplrbU1J2/R04cABdu3ZVOzygTmlpKWbMmIHc3Fxs2bJFeBMHnoWjq6srPD09oVAo0KRJE5Xbnj59Gps3b1YJDT8/P8yfPx/btm0r17uhTv/+/XHhwgVERUUJZ1zKSVaOjo44d+4cSkpKhFA7e/YsmjVrhu7du9fIsYiKisKSJUuwZMkSYdnYsWPh5uaGjRs3VjugJRIJPvjgA6xYsQIXLlyAnZ0dAGDPnj2QSCR4//331Z4NA8CCBQuQlpZWrp7Zs2dj0KBBWLBgARwcHFSCozr7MHHiRNy5cwfnzp3D8OHD1Ya90i+//IIjR46ofBBasWIFvv32W+zbtw9z586t9DhMmzYNkZGR8PPzw6pVq1TW+fn5AQCmTp1a6TZeJC8vT/hQ26tXr3LrlScYVlZWKsvPnTuHFStWwMbGBocPH1Y5Kz148CBmzJiBJUuWYO/evQCevUZmz56NgoIC7N69Gy4uLkL7H3/8Ed7e3lWu2dfXF2FhYRg8eDD27NmDN954Q1j35MkTPHjwAADg5eWFa9eu4ffffy/XtV+ZixcvYu3atTAxMUFISAiMjY0BAMuXL8fMmTNx8OBBfPHFF1i7dm252166dAnnzp1DmzZtADzryRo5ciQiIyMRFxeHnj17Anj2uiosLMS+ffvK9ZLk5eWhUaNGL6yzXnZxm5ubqz2TVH6SDw0NVXu7/v37q4Qz8Kz7pHXr1ggNDcXdu3cBAA8ePMCBAwdgZWWlEs4A0LhxY3z++ecAgJ9++umV9+VFGjduLIzxlQ3Zimhra6Nhw4bllr/shKk5c+ZUO5wBwNvbW+UNQUdHB0uXLgUA7Nu376VqeRUxMTH4448/0KNHD5VwBgAHBweMGDECf//9N06ePFnutnZ2duXe5N9//31oa2urdDlWRhkAZWc+nz17Fp06dcK4cePw8OFDXL58GQBw9+5dJCcno2/fvhWehVZXmzZtyr3BDhw4EK1bt67yPjzvvffeg7a2Nnbv3g0AKCgowKFDh+Dg4FDhB8m7d+/izJkzaNWqFRYsWKCy7u2338bUqVNRVFSk9rVVG/vg5uZWrpdi8uTJAFClbQ4fPhzGxsbCm7lSfn4+Dh8+DJlMVu1LkPz9/eHj44NVq1Zh7ty56NWrF5KSktCuXTt89NFH5dpPnjy5XDgDwLZt21BaWopvv/223MSy8ePHw8rKCidPnkReXh6AZ6+R5ORkyOVylXAGgA8//LDcRNWKlJSUYPv27XjjjTewbt06lXAGgIYNG5Y7gagu5YeKBQsWCOEMPPvg+MUXX0BHRwf+/v548uRJudsuWrRICGfg2Xum8vWt7jFXNwFOT08PjRs3fmGd9fIM+tGjR9i2bRtOnDiBv/76C3l5eSpdQOnp6Wpv17dv33LLtLW1IZfLkZqaCoVCARMTE8TFxaG4uBgNGjRQO46pnJmalJRUQ3tUOeW+Vda9DTw7m1i6dCnkcjlcXV3Rp08fyOXyV5pgofw0WV3qjnWfPn0gkUigUCheup6XdfXqVQCo8EzRwcEBQUFBuHr1armze2tr63LtlW8yVb30xcbGBk2aNEF4eDiAZ2F28eJFTJ48WQiI8PBw2NjYCCFeWfd2dXXt2lVt2Ldu3brS8cXKyGQyDB06FMeOHYOvry9++eUX5ObmCgGnjvJxkMvlas9AHBwcsHnzZqFdbe+Duse2devWAFClx1ZbWxsffPABVq9ejePHj8Pd3R3As4lyeXl5mD59utrhtsqUHddv0qQJzM3NMXHiROHStedV9BqNiYmBtrY2goKCVLpulR4/foySkhL89ddfsLa2Fo65utdugwYNYGdnh5s3b76w/qSkJDx8+BDW1tbCsaxplb2eDQ0N0blzZ8TFxeHPP/+EpaWlyvqqPuajR4/Gtm3bMHHiRDg7O6N///6wtbWt1qS4ehfQT548gbOzM+Li4tC5c2e4urqiZcuWwotg9erVKCoqUnvbij61GRgYAIAwHnj//n0Az8Ye4+PjK6xF3USEmlZYWCh0B5Wdxa3O7NmzYWBggJ07d2L79u3CxCMbGxt8/vnnVe4+KutlP+mqu90bb7wBPT094TjXJeV9VrQ/ysvF1E38qGicSUtLCyUlJVW6/0aNGqF3794ICQnB3bt3kZiYiKKiIjg4OMDExAQdOnTA2bNn4e3tXSsBXdk+vMq19pMnT8aJEydw6NAh/Pzzz2jZsiWGDRtWYfvaehxedh/UbVP5XlLVx3bKlCn45ptv4OfnJwS0n58fGjRoUOmHlYoEBQVV67Va0bG8f/8+iouLhQlzFVG+jykfG+X7YVXv53nKx67smW1Nq+nnkfKDX9nHvGfPnggODsY333wjPMcBwNTUFPPmzavS0EW9C+iTJ08iLi4O7733HrZs2aKy7t69e5U+GTMzM9Uuz8rKAvDsmuSy//X09MSaNWtqouyXFh0djeLiYshksipNoHB3d4e7uzsePnyIS5cu4dSpU9i9ezfGjh2LqKgovPXWW9W6/xedtVckMzNTpRsJePZhIy8vD82bNxeWNWjwbJSmojfDqsyUrArlY1rRc0A5o1XZrjbY29sjJCQE4eHhSExMhJaWljDj1t7eHnv27MF///tfREREQCaTlfvkL0bKLub169cjLS0Nc+bMqXRsTgyPQ01TzuoPDAzEjRs3UFhYiPj4eAwZMqTca6A2VPQabdasGZ48eYKUlJQqbUd5zJXvh8+r6DF7njIAK+rJrAlln0fqehVq6nlkY2ODgwcP4vHjx1AoFAgLC8OPP/6IBQsWQEdHBxMmTKj09vVuDFrZxaK8vKSsiialVLa+uLhYuFxGOY7Tq1cvNGjQANHR0a9a7ispKSkRJjmMHTu2Wrdt1qwZBgwYgLVr12L27NkoLCzEb7/9JqxXfmKsrW8qU3esz58/j9LSUpXxMuWLKzU1tVz74uJitd3h6j7tvki3bt0AoMJvK1Ketarr/qopyu64s2fP4uzZs+jRo4fwBmJvb4+ioiLs3r0bd+/erfKkrZc5FjWpQYMGmDRpEtLS0gAAH3zwQaXtlY99TExMuUvugJp7HOr6uHz44YcAnp05KyeHeXh41Ml9V8TGxgZ5eXm4du1aldorXyPqXrtPnz7FhQsXqrSdDh06QF9fH3/88Yfa1/XzXua9qLLXc1ZWFm7cuIGmTZu+8AqLqmrUqBF69eoFb29vbNu2DQCqdI1/vQto5VlkVFSUyvJbt25h2bJlld42IiICp0+fVlm2detWpKamCtcQAs+6kt3d3XHt2jX4+Pio/TaktLS0Wh2DvnfvHqZMmYLz58/D1NS03IQadc6ePav2cgzlp8mys5OVE8+q+um6utauXasynlNQUICVK1cCgMqEKz09PXTs2BExMTG4fv26sLy0tBS+vr5qX+DK2qvy4leSy+Xo2LEj4uLiyk1AOnv2LIKCgtCiRYtKu2dflZWVFZo3b44zZ85AoVCodGG/88470NLSwrp16wBUPFb+vJc5FjXN09MT+/btw88///zCHppWrVph4MCBSEtLw4YNG1TW3bhxAzt37kTjxo0xbty4V6qpro+Lvb09OnTogIMHD+Lnn39G69atK70OvC4oJ83OmzdP+ABVVmFhocpJiFwuh4WFBWJiYnD8+HGVtjt27KjS+DPwLHCnTZuGwsJCLFiwoNyQY3FxscrZ+Mu8F73//vsAgHXr1qlcz11aWoply5bhv//9LyZMmKB2wmxVxcTEqP1eDXXvpxX5x3VxV/adwitXrsTQoUPRrl07bN68GQkJCbCyskJqaipOnz6NwYMHV/qCHDJkiDDgb25uDoVCgTNnzuDNN9/E119/rdJ2zZo1uHnzJlavXo2ffvpJuKg9IyMDf/75Jy5evIivvvrqlb9FJzc3V5iIVlJSgtzcXNy4cQMxMTF48uQJbGxs8OOPP5b7tiZ1Jk2ahKZNm6JXr14wNTWFRCJBXFwcoqOj0bZtW4waNUpo6+joiGPHjmHu3LlwdnaGrq4u9PX14enp+Ur7o2RhYYHevXvD2dkZ2tra+OWXX3Dr1i0MGzas3CSsuXPnwsvLC05OThg1ahSaNGmCmJgYpKWl4Z133in3YczCwgKtW7dGdHQ0PvroI7Rv3x5aWlpwcnJCly5d1NYjkUiwdetWjBo1CjNmzMDRo0eF66ADAwPRqFEjbNu2rUovupfVoEED9OvXD4GBgQBUx5ilUimsra2FWaRVHX9+mWNR05o3b17pl3U8b926dRg6dCi++uorREREwMbGRrgOurCwEOvXr3/lyUX9+/dHgwYNsG3bNjx48EAYq/T09Ky1b8CbOnWqcAnYvHnzhOEbTenfvz++/PJLLFu2DD179sS7774Lc3NzFBYWIiUlRfjwr3x9SSQSbNy4Ea6urvDw8FC5DjosLAyDBg3CmTNnqnTfixcvRlxcHH799Vf06NEDQ4cORbNmzZCWloazZ88Kr3ng2XvRd999hy+++AI3btwQetUqu6zL1tYWCxYswLp169C7d2+MGjUKzZo1Q1hYGK5evYrOnTsLV9u8rA0bNiAiIgK9e/eGmZkZ9PT08Oeff+L06dPQ0dGp0vff/+MCWt03EyktWbIELVq0QGBgIFasWIGoqChER0fD3Nwc3t7emDVrlsq1zM9zdnbG5MmTsW7dOgQHB6Nhw4ZwcXHBsmXLyl1CoKenhxMnTmDv3r04fPgwTpw4gcLCQhgYGMDMzAzLli2Dq6vrK+/vw4cPhXHzRo0aQU9PD6amppg4cSJcXFxgb29f5Rf68uXLERoaimvXriEkJATa2tpo3bo1Fi9ejOnTp6u8MU2aNAmpqakICAjAli1b8OTJE7Rp06bGAnrXrl1YvXo1AgICkJGRAWNjY3z66aeYP39+uTEz5WVPmzZtwsGDB6Grq4sBAwZg7969+Oqrr8ptW0tLC/v27cPy5ctx+vRpYRa/iYlJpaHUo0cPhIeHY+3atQgPD0dISAj09fUxfPhwfPLJJ2ovValp9vb2CAwMhI6ODmxtbcuti4uLg7m5eZW/sOFlj4UmmZmZITw8HF9//TWCg4Nx4cIFNG3aFH379sWcOXNeajLj8zp06IAffvgBGzduxL59+4QzoXHjxtVaQE+YMAH/+te/IJFIMGnSpFq5j+r6+OOPYWdnh23btiE6OhrBwcHQ1dWFsbExxo0bV+49zM7ODqdOncKXX36JkJAQhISEoGfPnjhx4gRCQkKqHNCNGjVCQEAAdu3ahYMHD+Knn35CSUkJZDIZBg4cCEdHR6Gto6MjfH19sWvXLmzfvl04437Rddeff/45rKys8MMPP+Dw4cMoKiqCmZkZFi5ciLlz56r9kpLqmDZtGpo3b464uDjhhMnY2Bjjx4/H7Nmzq3RyJsnJyand75wkIqIXio2NxeDBg+Hs7Iw9e/ZouhwSgXo3Bk1EJEbr168HgBrrhaLX3z+ui5uI6HVx/fp1nD59GgqFAidPnoSDg0O5H6ug+osBTUSkIfHx8fjiiy/QrFkzjBgxQpiFTwRwDJqIiEiUOAZNREQkQgxoIiIiEWJAExERiRADml4bycnJmi6BqM7w+U4MaCIiIhFiQBMREYkQA5qIiEiEGNBEREQixIAmIiISIQY0ERGRCDGgiYiIRIgBTUREJEL8Nat6Rl8q1XQJL62Xpgt4Bbk5OZougYheMzyDJiIiEiEGNBERkQgxoImIiESIAU1ERCRCDGgiIiIRYkATERGJEAOaiIhIhBjQREREIsSAJiIiEiEGNBERkQgxoImIiESIAU1ERCRCDGgiIiIR0lhAr1u3Do6OjmjTpg3at28Pd3d3JCQkqLQpLS2Fj48POnXqBCMjIwwfPhw3btxQaZOTkwNPT0+YmprC1NQUnp6eyHnul4OuX7+OYcOGwcjICJaWlli9ejVKS0tV2hw/fhxyuRyGhoaQy+UICgqqnR0nIiKqAo0FdFRUFD788EOcPn0agYGB0NbWxqhRo/DgwQOhzYYNG7B582asXr0aoaGhMDAwgKurK/Ly8oQ206ZNg0KhQEBAAAICAqBQKDB9+nRh/cOHD+Hq6gpDQ0OEhobC19cXGzduxKZNm4Q2sbGxmDp1KsaOHYvIyEiMHTsWU6ZMwaVLl+rmYBARET1HkpOTU/riZrUvPz8fpqam2L9/P5ycnFBaWopOnTrho48+wsKFCwEABQUFsLCwwJdffgkPDw8kJiZCLpcjODgYdnZ2AIDo6Gg4OTnh4sWLsLCwwI4dO7B8+XIkJSVBR0cHALB27Vrs3LkTCQkJkEgk8PDwwIMHD3Ds2DGhHhcXF7Rs2RI7duyo+4NRi17n34N+nfH3oKm6kpOTYWFhoekySINEMwadn5+Pp0+fQvq/ALl9+zYyMjIwYMAAoY2Ojg769OmDmJgYAM/OfHV1dSGXy4U2dnZ2aNq0qUqb3r17C+EMAAMHDkR6ejpu374NALh48aLK/SjbKLdBRERU17Q1XYDSkiVL0LVrV9ja2gIAMjIyAAAGBgYq7QwMDJCeng4AyMzMRIsWLSCRSIT1EokELVu2RGZmptDGxMSk3DaU68zNzZGRkaH2fpTbqEhycnJ1d1Pjemm6gHrqdXyukObxeVO3xNZjIYqA/uyzz3DhwgUEBwdDS0tL0+VUmdgeTBIvPleoutjFTRrv4v7000/x888/IzAwEObm5sJymUwGAMjKylJpn5WVBUNDQwCAoaEhsrOzVWZkl5aW4u+//1Zpo24bynXK+6rsfoiIiOqaRgN68eLFQjh36NBBZZ2ZmRlkMhnCwsKEZYWFhYiOjhbGnG1tbZGfn4/Y2FihTWxsLB49eqTSJjo6GoWFhUKbsLAwGBsbw8zMDABgY2Ojcj/KNmXHtomIiOqSxgJ64cKF8Pf3x48//gipVIqMjAxkZGQgPz8fwLOx5JkzZ2LDhg0IDAxEQkICvLy80LRpU4wZMwYA0LFjRwwaNAjz589HbGwsYmNjMX/+fAwZMkToGhozZgx0dHTg5eWFhIQEBAYGYv369fDy8hLGrmfMmIGIiAh8++23SEpKwrp16xAZGYmZM2dq5uAQEVG9p7HLrKQVXO6zePFifPrppwCedVf7+vpi165dyMnJQc+ePfH111+jc+fOQvucnBwsWrQIp06dAgA4OTlhzZo1Ktu/fv06Fi5ciMuXL0MqlcLDwwOLFy9WmVx2/PhxrFy5Erdu3ULbtm2xdOlSODs718auaxQvs9IMXmZF1cUxaBLNddBUNxjQmsGApupiQJPGJ4kRERFReQxoIiIiEWJAExERiRADmoiISIQY0ERERCLEgCYiIhIhBjQREZEIMaCJiIhEiAFNREQkQgxoIiIiEWJAExERiRADmoiISIQY0ERERCLEgCYiIhIhBjQREZEIMaCJiIhEiAFNREQkQgxoIiIiEWJAExERiRADmoiISIQY0ERERCLEgCYiIhIhBjQREZEIMaCJiIhEiAFNREQkQgxoIiIiEWJAExERiRADmoiISIQY0ERERCLEgCYiIhIhBjQREZEIMaCJiIhEiAFNREQkQgxoIiIiEWJAExERiRADmoiISIQY0ERERCLEgCYiIhIhBjQREZEIMaCJiIhESKMBfe7cOYwfPx6WlpaQSqXYv3+/yvqZM2dCKpWq/A0aNEilTVFREby9vdGuXTuYmJhg/PjxSEtLU2mTkpICd3d3mJiYoF27dli0aBEeP36s0iYqKgr29vaQyWTo1q0bdu7cWTs7TUREVAUaDehHjx6hc+fO8PX1hY4mWGxaAAAbhElEQVSOjto2Dg4OSExMFP4OHz6ssv7TTz9FUFAQduzYgZMnTyIvLw/u7u4oKSkBAJSUlMDd3R35+fk4efIkduzYgcDAQPzrX/8StnHr1i2MGzcOtra2iIiIwIIFC7Bo0SIcP3689naeiIioEtqavPPBgwdj8ODBAAAvLy+1bRo3bgyZTKZ2XW5uLvbu3YvNmzfD0dERAPD999+ja9euCA8Px8CBAxEaGoobN27g2rVraN26NQBgxYoVmDNnDv7973+jWbNm8PPzg5GREdauXQsA6NixIy5duoRNmzbBxcWlpnebiIjohUQ/Bh0dHY233noLPXv2xJw5c5CVlSWsi4+Px5MnTzBgwABhWevWrdGxY0fExMQAAGJjY9GxY0chnAFg4MCBKCoqQnx8vNCm7DaUba5cuYInT57U5u4RERGppdEz6BcZNGgQRo4cCTMzM9y5cwcrV66Es7MzwsPD0bhxY2RmZkJLSwstWrRQuZ2BgQEyMzMBAJmZmTAwMFBZ36JFC2hpaam0cXBwKLeN4uJiZGdnw8jISG19ycnJNbSndaeXpguop17H5wppHp83dcvCwkLTJagQdUC7ubkJ///222/D2toaXbt2xenTp+Hs7KzByp4R24NJ4sXnClVXcnIynzf1nOi7uMsyNjaGiYkJbt68CQAwNDRESUkJsrOzVdplZWXB0NBQaFO2WxwAsrOzUVJSUmmbrKwsaGtrlzs7JyIiqguvVUBnZ2cjPT1dmDRmbW2Nhg0bIiwsTGiTlpaGxMREyOVyAICtrS0SExNVLr0KCwtD48aNYW1tLbQpuw1lm+7du6Nhw4a1vVtERETlaDSg8/PzoVAooFAo8PTpU6SmpkKhUCAlJQX5+flYunQpYmNjcfv2bURGRmL8+PEwMDDAiBEjAAD6+vqYNGkSli1bhvDwcFy9ehXTp0/H22+/LYwpDxgwAJaWlpgxYwauXr2K8PBwfP755/jggw/QrFkzAICHhwfS09OxZMkSJCYmYs+ePfD398fs2bM1dWiIiKiek+Tk5JRq6s4jIyMxcuTIcssnTJiAdevWYeLEiVAoFMjNzYVMJkO/fv3wr3/9S2VGdlFREZYuXYqAgAAUFhaif//++Oabb1TapKSkYOHChYiIiMAbb7yBsWPH4ssvv0Tjxo2FNlFRUfjss8/wxx9/wMjICPPmzcPUqVNr9wBogL5UqukS6qXcnBxNl0CvGY5Bk0YDmuoeA1ozGNBUXQxoeq3GoImIiOoLBjQREZEIMaCJiIhEiAFNREQkQgxoIiIiEWJAExERiRADmoiISISqFdApKSkoKCiocH1BQQFSUlJeuSgiIqL6rloB3a1bN5w4caLC9adOnUK3bt1euSgiIqL6rloBXVpa+ZeOFRcXQyKRvFJBRERE9BJj0BUFcG5uLs6cOQMDA4NXLoqIiKi+e2FA+/r64s0338Sbb74JiUQCT09P4d9l/9q2bYvDhw/Dzc2tLuomIiL6R9N+UYOePXviww8/BABs374djo6OaN++vUobiUSCpk2bwtraGs7OzrVTKRERUT3ywoB+99138e677wIAHj16hKlTp6JXr161XhgREVF99sKALmvLli21VQcRERGVUa2ABoCSkhKEhITg1q1byMnJKTezWyKRYNGiRTVWIBERUX1UrYC+cuUKJk2ahLt371Z4yRUDmoiI6NVVK6A/+eQTFBYWYv/+/ejduzekUmlt1UVERFSvVSugr1+/jqVLl8LJyam26iEiIiJU84tKTExMaqsOIiIiKqNaAT1//nzs3r0bDx8+rK16iIiICNXs4n7w4AGaNGmCHj16wMXFBa1atYKWlpZKG4lEgjlz5tRokURERPWNJCcnp/JfwCijefPmL96gRIL79++/UlFUe/Q5sU8jcnNyNF0CvWaSk5NhYWGh6TJIg6p1Bn316tXaqoOIiIjKqFZAm5qa1lYdREREVEa1f26SiIiIal+1zqCtrKwq/D1oJYlEgvj4+FcqioiIqL6rVkD37du3XECXlJQgJSUFMTExsLS0hJWVVY0WSEREVB9VK6C3bt1a4bpr167Bzc0N48aNe+WiiIiI6rsaG4Pu2rUrpkyZgmXLltXUJomIiOqtGp0kZmhoiMTExJrcJBERUb1UYwF9//597N27l9/XTUREVAOqNQY9cuRItctzc3ORnJyMx48f4/vvv6+RwoiIiOqzagX006dPy83ilkgkMDMzg4ODA95//3106NChRgskIiKqj6oV0L/88ktt1UFERERl8JvEiIiIRKjaAf3gwQN8/vnnsLOzg4mJCUxMTGBnZ4fly5fjwYMHtVEjERFRvVOtn5tMTU3F0KFDkZaWBmtra2G8OTk5GVeuXEGrVq0QHByM1q1b11rB9Gr4c5OawZ+bpOriz01Stcagly9fjtzcXAQFBeGdd95RWXf+/HmMHz8eK1aswI8//lijRRIREdU31eriDg0NxfTp08uFMwD06dMHnp6eCAkJqbHiiIiI6qtqBXRBQQFatmxZ4fqWLVuioKCgyts7d+4cxo8fD0tLS0ilUuzfv19lfWlpKXx8fNCpUycYGRlh+PDhuHHjhkqbnJwceHp6wtTUFKampvD09ETOc92J169fx7Bhw2BkZARLS0usXr0apaWqPfvHjx+HXC6HoaEh5HI5goKCqrwfRERENa1aAd2pUyccPnwYRUVF5dY9fvwYhw4dgqWlZZW39+jRI3Tu3Bm+vr7Q0dEpt37Dhg3YvHkzVq9ejdDQUBgYGMDV1RV5eXlCm2nTpkGhUCAgIAABAQFQKBSYPn26sP7hw4dwdXWFoaEhQkND4evri40bN2LTpk1Cm9jYWEydOhVjx45FZGQkxo4diylTpuDSpUtV3hciIqKaVK1JYsePH8eUKVNgaWmJqVOn4q233gLwbDKDn58f/vjjD+zevbvCbxyrTKtWrbBmzRpMnDgRwLOz506dOuGjjz7CwoULATw7g7ewsMCXX34JDw8PJCYmQi6XIzg4GHZ2dgCA6OhoODk54eLFi7CwsMCOHTuwfPlyJCUlCR8C1q5di507dyIhIQESiQQeHh548OABjh07JtTj4uKCli1bYseOHdXeFzHjJDHN4CQxqi5OEqNqnUG7uLhg27ZtyM7Ohre3N0aPHo3Ro0dj0aJFyM7OxtatW18qnNW5ffs2MjIyMGDAAGGZjo4O+vTpg5iYGADPznx1dXUhl8uFNnZ2dmjatKlKm969e6ucoQ8cOBDp6em4ffs2AODixYsq96Nso9wGERFRXavWLG4AcHd3h5ubG65cuYKUlBQAQJs2bdC9e3doa1d7cxXKyMgAABgYGKgsNzAwQHp6OgAgMzMTLVq0UPn6UYlEgpYtWyIzM1No8/wPeCi3mZmZCXNzc2RkZKi9H+U2KpKcnPwSe6ZZvTRdQD31Oj5XSPP4vKlbYuuxeKlE1dbWho2NDWxsbGq6nteK2B5MEi8+V6i62MVNL+zivnfvHmxsbLBy5cpK261cuRK2trb4+++/a6QwmUwGAMjKylJZnpWVBUNDQwDPfn86OztbZUZ2aWkp/v77b5U26rahXKe8r8ruh4iIqK69MKC///57PHjwAHPnzq203dy5c3H//v0a+7lJMzMzyGQyhIWFCcsKCwsRHR0tjDnb2toiPz8fsbGxQpvY2Fg8evRIpU10dDQKCwuFNmFhYTA2NoaZmRkAwMbGRuV+lG3Kjm0TERHVpRcG9K+//gpXV1fo6elV2k5PTw9ubm44depUle88Pz8fCoUCCoUCT58+RWpqKhQKBVJSUiCRSDBz5kxs2LABgYGBSEhIgJeXF5o2bYoxY8YAADp27IhBgwZh/vz5iI2NRWxsLObPn48hQ4YIXUNjxoyBjo4OvLy8kJCQgMDAQKxfvx5eXl7C2PWMGTMQERGBb7/9FklJSVi3bh0iIyMxc+bMKu8LERFRTXphQP/nP/9Bly5dqrSxzp074+bNm1W+8ytXrqB///7o378/CgoK4OPjg/79+2PVqlUAnp2Vz5w5E97e3nB0dMS9e/dw5MgRlQ8L27dvR5cuXeDm5gY3Nzd06dJF5SxeX18fR48eRXp6OhwdHeHt7Y1Zs2Zh9uzZQhu5XI6dO3fC398fffv2xcGDB7Fz50706sUpVUREpBkvvA66VatWWLlyJTw8PF64MT8/PyxduhRpaWk1ViDVLF4HrRm8Dpqqi5PE6IVn0KampoiLi6vSxi5fvgxTU9NXLoqIiKi+e2FADxkyBD///DOSkpIqbZeUlISAgAAMHTq0xoojIiKqr14Y0B9//DF0dXUxcuRIBAQEoLi4WGV9cXExAgIC4OzsDD09PZWxXSIiIno5Vfou7vj4eLz//vu4e/cu3njjDbz11lvQ1dVFfn4+/vzzTxQWFsLY2Bj79++HtbV1XdRNL4lj0JrBMWiqLo5BU5W+Scza2hrnz5+Hn58fgoOD8ccffyAvLw96enqwsrKCk5MTpkyZAn19/dqul4iIqF6o1q9Z0euPZ9CawTNoqi6eQVO1fs2KiIiI6gYDmoiISIQY0ERERCLEgCYiIhIhBjQREZEIMaCJiIhEiAFNREQkQgxoIiIiEWJAExERiRADmoiISIQY0ERERCLEgCYiIhIhBjQREZEIMaCJiIhEiAFNREQkQgxoIiIiEWJAExERiRADmoiISIQY0ERERCLEgCYiIhIhBjQREZEIMaCJiIhEiAFNREQkQgxoIiIiEWJAExERiRADmoiISIQY0ERERCLEgCYiIhIhBjQREZEIMaCJiIhEiAFNREQkQgxoIiIiEWJAExERiRADmoiISIREHdA+Pj6QSqUqfx06dBDWl5aWwsfHB506dYKRkRGGDx+OGzduqGwjJycHnp6eMDU1hampKTw9PZGTk6PS5vr16xg2bBiMjIxgaWmJ1atXo7S0tE72kYiISB1RBzQAWFhYIDExUfg7f/68sG7Dhg3YvHkzVq9ejdDQUBgYGMDV1RV5eXlCm2nTpkGhUCAgIAABAQFQKBSYPn26sP7hw4dwdXWFoaEhQkND4evri40bN2LTpk11up9ERERlaWu6gBfR1taGTCYrt7y0tBRbt27FvHnz4OLiAgDYunUrLCwsEBAQAA8PDyQmJuLMmTMIDg6Gra0tAODbb7+Fk5MTkpOTYWFhgcOHD6OgoABbt26Fjo4OOnfujKSkJGzZsgWzZ8+GRCKp0/0lIiICXoMz6Fu3bqFTp06wsrLC1KlTcevWLQDA7du3kZGRgQEDBghtdXR00KdPH8TExAAAYmNjoaurC7lcLrSxs7ND06ZNVdr07t0bOjo6QpuBAwciPT0dt2/froM9JCIiKk/UZ9C9evXCli1bYGFhgb///htr167F4MGDceHCBWRkZAAADAwMVG5jYGCA9PR0AEBmZiZatGihchYskUjQsmVLZGZmCm1MTEzKbUO5ztzcvML6kpOTX3kf61ovTRdQT72OzxXSPD5v6paFhYWmS1Ah6oB+9913Vf7dq1cvWFtbw9/fHzY2Nhqq6v+J7cEk8eJzhapLOQxH9Zfou7jL0tXVRadOnXDz5k1hXDorK0ulTVZWFgwNDQEAhoaGyM7OVpmRXVpair///luljbptKNcRERFpwmsV0IWFhUhOToZMJoOZmRlkMhnCwsJU1kdHRwtjzra2tsjPz0dsbKzQJjY2Fo8ePVJpEx0djcLCQqFNWFgYjI2NYWZmVkd7RkREpErUAb106VJERUXh1q1buHTpEiZPnoz//ve/mDBhAiQSCWbOnIkNGzYgMDAQCQkJ8PLyQtOmTTFmzBgAQMeOHTFo0CDMnz8fsbGxiI2Nxfz58zFkyBCh62jMmDHQ0dGBl5cXEhISEBgYiPXr18PLy4szuImISGNEPQZ99+5dTJs2DdnZ2WjZsiV69eqF3377DaampgCAuXPnoqCgAN7e3sjJyUHPnj1x5MgR6OnpCdvYvn07Fi1aBDc3NwCAk5MT1qxZI6zX19fH0aNHsXDhQjg6OkIqlWLWrFmYPXt23e4sERFRGZKcnBx+ZVY9oi+VarqEein3uW+vI3oRThIjUXdxExER1VcMaCIiIhFiQBMREYkQA5qIiEiEGNBEREQixIAmIiISIQY0ERGRCDGgiYiIRIgBTUREJEIMaCIiIhFiQBMREYkQA5qIiEiEGNBEREQixIAmIiISIQY0ERGRCDGgiYiIRIgBTUREJEIMaCIiIhFiQBMREYkQA5qIiEiEGNBEREQixIAmIiISIQY0ERGRCDGgiYiIRIgBTUREJEIMaCIiIhFiQBMREYkQA5qIiEiEGNBEREQixIAmIiISIQY0ERGRCDGgiYiIRIgBTUREJEIMaCIiIhFiQBMREYkQA5qIiEiEGNBEREQixIAmIiISIQY0ERGRCGlrugAiotqiL5VquoSX1kvTBbyC3JwcTZfwj8Az6DK2b98OKysryGQy2Nvb4/z585ouiYiI6ikG9P8cOXIES5YswSeffIKIiAjY2tpi7NixSElJ0XRpRERUD0lycnJKNV2EGAwcOBBvv/02vvvuO2FZjx494OLigmXLlmmwspr1Onf5vc7Y5acZfL5rBp/vNYNj0AAeP36M+Ph4fPzxxyrLBwwYgJiYGA1VVTv4wqH6hM93ep2xixtAdnY2SkpKYGBgoLLcwMAAmZmZGqqKiIjqMwY0ERGRCDGgAbRo0QJaWlrIyspSWZ6VlQVDQ0MNVUVERPUZAxpAo0aNYG1tjbCwMJXlYWFhkMvlGqqKiIjqM04S+59Zs2Zh+vTp6NmzJ+RyOXbu3Il79+7Bw8ND06UREVE9xID+n9GjR+P+/ftYu3YtMjIyYGlpiUOHDsHU1FTTpRERUT3E66BJ1EpKSpCdnQ3g/+cKEBHVBzyDJlEKCgrCpk2bcOXKFRQXFwMAtLW10b17d3z88ccYMWKEhiskqnl37twRLu00NDRkD149xzNoEh0/Pz8sWrQIEyZMwMCBA4Xr07OyshAaGoqDBw9izZo1mDx5soYrJaoZmzdvxpYtW5Ceno7S0mdvyRKJBMbGxpg1axa8vLw0XCFpAgOaRKd79+6YP38+PvjgA7Xr9+zZg3Xr1iE+Pr6OKyOqeWvWrMHGjRsxd+5ctR9IN2zYgI8//hje3t4arpTqGgOaRMfIyAiRkZGwsLBQuz4pKQn9+/fHvXv36rgyopr39ttvw8fHB87OzmrXBwYGYvHixbhx40YdV0aaxuugSXQ6deqEHTt2VLjez88PnTp1qsOKiGrP/fv30bFjxwrXW1hYIIffKV4v8QyaRCcqKgru7u4wNjaGo6Oj8G1umZmZCA8PR3p6Og4dOoQ+ffpouFKiVzd8+HAYGRlh69ataNSokcq6x48fw8vLC+np6fjll180VCFpCgOaROn27dvYuXMnLl68qDKr1dbWFh4eHjAzM9NwhUQ1IyEhAa6urigoKEDv3r1VPpBGR0ejSZMmOHr0KCwtLTVcKdU1BjQRkYbl5eXh0KFDaj+QjhkzBs2aNdNwhaQJDGgiIiIR4iQxeu3MmDGjwhmvRET/FAxoeu2Ulpbi6dOnmi6DqE64uLjA2tpa02WQBvCrPum18/3332u6BKI606NHD7Rq1UrTZZAGcAyaRCktLQ07d+5ETEyMyqQZOzs7eHh48A2LiP7x2MVNohMdHQ25XI5jx47B0tISY8aMwZgxY2BpaYljx47Bzs4OFy5c0HSZRHUiNTUVs2bN0nQZpAE8gybRcXBwgK2tLdasWaN2/eLFixEbG4uwsLA6royo7l27dg329va4f/++pkuhOsYxaBKdP/74Az/++GOF6z/88EPs3r27Disiqj0HDhyodH1qamodVUJiw4Am0ZHJZLhw4UKFP5Zx4cIFyGSyOq6KqHZ4eXmhSZMmkEgkatfzioX6iwFNovPxxx9jwYIFuHz5MhwcHMp9F7e/vz98fHw0XCVRzTA2Noavr2+F1/YrFAo4ODjUbVEkCgxoEp1p06bhzTffxJYtW7B3716UlJQAALS0tGBtbY1t27bB1dVVw1US1Yxu3bpBoVBUGNASiQSlpZwqVB9xkhiJ2pMnT5CdnQ0AaNGiBRo2bKjhiohq1vnz5/Ho0SO8++67atc/evQIV65cwTvvvFPHlZGmMaCJiIhEiNdBExERiRADmoiISIQY0ESkllQqxfz582t8m5yBT1Q1DGiif5D9+/dDKpXi4sWLmi6FiF4RA5qIiEiEGNBEREQixIAmqkceP36MVatWwdHREWZmZjAyMsKAAQNw4sSJCm9z5MgRyOVyyGQy9O3bF2fOnCnXJjc3F5999hm6dOkCAwMDdOvWDWvWrBG+ZIaIqo/fJEZUj+Tl5WHXrl0YPXo0Jk6ciMLCQgQEBOD999/H4cOHy31ZRkxMDI4ePYrp06dDV1cXu3fvxvjx4xEUFITevXsDAAoKCjBy5EjcuXMHU6dOhampKeLi4uDr64uUlBRs3LhRE7tK9NpjQBPVI1KpFNeuXUPjxo2FZZ6enrC3t8emTZvKBXRCQgJ+/fVX2NraAgAmTpyIHj16YMWKFQgODgYAbNmyBcnJyTh79iw6dOgAAJgyZQrMzMywcuVKzJkzp8IfPiGiirGLm6ge0dLSEsL58ePHePDgAfLy8tCnTx/Ex8eXa9+9e3chnAHgzTffxNixY3HhwgXk5OQAAI4ePQo7Ozu0aNEC2dnZwp/yBx6ioqJqf8eI/oF4Bk1Uz+zZswdbtmxBYmKiyo8wqPu5w/bt21e47M6dO5BKpfjrr7/w+++/q20LAFlZWTVUOVH9woAmqkcOHTqEOXPmYOjQoZg7dy4MDAygra2N/fv34/Dhwy+1zadPn6J///5YsGCB2vXm5uavUDFR/cWAJqpHjh07BnNzcxw4cEDljHn//v1q2//1118VLjM1NQUAtG3bFvn5+fzNYqIaxjFoonpES0sLAFS6tm/dulXhZVZXrlxBbGys8O/79+/j8OHDkMvlkEqlAABXV1dcvnwZv/76a7nb5+XloaioqCZ3gaje4Bk00T+Qv78/wsPDyy13cnJCUFAQJkyYACcnJ9y9exc7duzAW2+9hWvXrpVr37lzZ7i7u8PT01O4zCo/Px+ff/650GbOnDkIDg7Ge++9hwkTJsDa2hoFBQW4ceMGjh8/jnPnzsHMzKw2d5foH4kBTfQP5Ofnp3Z5bGwsvvjiC+zYsQPh4eFo164dVq1ahZs3b6oNaLlcjn79+sHX1xe3bt3CW2+9hf3796Nv375CGx0dHZw4cQLffvstjh49ip9++gm6urpo3749vL29IZPJam0/if7JJDk5OaUvbkZERER1iWPQREREIsSAJiIiEiEGNBERkQgxoImIiESIAU1ERCRCDGgiIiIRYkATERGJEAOaiIhIhBjQREREIsSAJiIiEqH/A3FjpcMZpK7pAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "plt.style.use('fivethirtyeight')\n",
    "\n",
    "label_times['label'].value_counts().plot.bar(color = 'r');\n",
    "plt.xlabel('Label'); plt.ylabel('Count'); plt.title('Label Distribution with Monthly Predictions');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an imbalanced classification problem. There are far more instances of customers not churning than of customers churning. This is not necessarily an issue as long as we are smart about the choices of metrics we use for modeling. \n",
    "\n",
    "\n",
    "## Second Prediction Problem\n",
    "\n",
    "To demonstrate how to quickly change the problem parameters, we can use the labeling function for a different prediction problem. The parameters are defined below:\n",
    "\n",
    "* Bimonthly churn\n",
    "    * Prediction date = first and fifteenth of month\n",
    "    * Number of days to churn = 14\n",
    "    * Lead time = 2 weeks\n",
    "    * Prediction window = 2 weeks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:40:01.887037Z",
     "start_time": "2018-10-31T16:39:12.674043Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>msno</th>\n",
       "      <th>cutoff_time</th>\n",
       "      <th>label</th>\n",
       "      <th>days_to_churn</th>\n",
       "      <th>churn_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2016-11-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2016-12-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2016-12-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-01-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-01-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-02-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-02-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-03-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-03-15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=</td>\n",
       "      <td>2017-04-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            msno cutoff_time  label  \\\n",
       "45  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2016-11-15    0.0   \n",
       "46  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2016-12-01    0.0   \n",
       "47  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2016-12-15    0.0   \n",
       "48  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-01-01    0.0   \n",
       "49  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-01-15    0.0   \n",
       "50  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-02-01    0.0   \n",
       "51  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-02-15    0.0   \n",
       "52  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-03-01    0.0   \n",
       "53  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-03-15    NaN   \n",
       "54  zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=  2017-04-01    NaN   \n",
       "\n",
       "    days_to_churn churn_date  \n",
       "45            NaN        NaT  \n",
       "46            NaN        NaT  \n",
       "47            NaN        NaT  \n",
       "48            NaN        NaT  \n",
       "49            NaN        NaT  \n",
       "50            NaN        NaT  \n",
       "51            NaN        NaT  \n",
       "52            NaN        NaT  \n",
       "53            NaN        NaT  \n",
       "54            NaN        NaT  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_times = make_label_times(trans, prediction_date = 'SMS', churn_days = 14,\n",
    "                               lead_time = 1, prediction_window = 1)\n",
    "label_times.tail(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:40:02.094679Z",
     "start_time": "2018-10-31T16:40:02.091142Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(54527, 5)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_times.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:40:02.420085Z",
     "start_time": "2018-10-31T16:40:02.297025Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAboAAAEdCAYAAAB67qLTAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvDW2N/gAAHhVJREFUeJzt3XmcJWV97/HPl2FTDD0QDSKLoExUuLmgImI0kStXFrNgctWYiKIhmMQlMTFXcYngFjU3ivCKGo0oiFHiEpUoSJAlbmFxAWTR9ARQBkEQmEFlE/jdP+rpcDh0T3fPnO4zU/15v17n1VVPPVX1O8+pc371PFXndKoKSZL6apNxByBJ0kIy0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1Et4FKskuSSvKUDWE7A9t7YZK7RrGtabZ9n1hHHfs0+9uvbX/Hhdj++khyVZLXz1JnQeNPckKSLy3Etkdlrm3Q6hy6WHGNwvBzG9XrvTG8rqNmolsgG9PB1JJXtcc9SdYkuTDJu5I8fKj6PwM7zGPbX0pywhyrXw1sD5w31+3PI467krxwqPjrbX8/HPX+RuAJwDFTM0lWJjl6FBseer0ryW1JvpvklUNV/xx49ij2OQqjbIN13P85A212Z5L/SvK2JA9cpBDmdbwmOTTJdF+U3qBe18Ww6bgD0AbjbmDqTPEXgL2AvwCOSHJwVX0VoKpuA24b9c6TbF5VdwLXjXrbM1ns/c1HVd2wwLsYfL0fABwA/H2S66vqpBbDmgWOYWP0MeCVwObAU4EPAFsDL52ucpLNqurno9jxqI7Xpfi62qMbkyR/kOS81nv6cZIvJPnlaarukuTMdtZ9RZLnDm1nu9Z7vCHJT5J8Lcmvr0tMVXVde0xW1SeBXwcuBk5Isqzt7z5Dl0m2TvLhJNcluSPJ1Une1ZadAOwPHDZwJrzfwJDk85KcmuRnwJvXMlQ5YxvMtM7g2X+Sq4BlwIen4mjl9xsKSrJvki+3fd2c5GNJfmlg+dFt24e0XtDP2pn+ipnaNcn+rQfwwDa/ZZLbk3x1oM7TW50HTcWcNnSZ5BzgkcBRA+24y8AuHtNivjXJZUkOnimWQQOv95VV9X6613rvgZjuMyoxNZ/k5UlWJflpkg8m2SzJnyT5fmuzDyTZfGC9zZK8Pck17TleluQPhtqokrwkyUntOF6V5DUDy0fWBu15/Ns05WclOX6WZruttdkP2gnBR4HfbetPHU+/keSrSW4H/qgte3ySf2ttdkOSf8nQaMlAu96a5HRg56Hl0x2vj0zyqSQ3tfUuTvKbSfYDThpo22rvx+le1yT5q/bemuqpvmJo31cleVOSY9u+fpTkmCSbDtR5SrrPn5+0x0VJDpylPRdHVflYgAdwAvCltSx/EfBbdG/exwKnAJPA5m35LkDRDVM8D3gU8Ba6M/HHtjoPAC4DPk33AbUb8DrgDuAxQ9t5ylpieSFw1wzL/k9bf+/p6gLHARcBT6R7Y/4qcERbNgF8mW6486HtsflATKvac9u1Pe4T6xzbYNrnB6wEjm7TDwHuohuyeSjw0Fa+X1t3xzb/UOAWurP2XwGeQvfh/+WB7R4N/Az4IvB4YE/gm8BX1tK+DwBuBw5s8/sDN7TXaatW9jbgawPrXAW8vk1vC1wJ/N1AOy4biP8i4CBgBfDh9hy2mevrDaRt61bg92Y6htv8LcCJwGPojt/bgdOAj7Sy36Dr8f/pwHr/D7iRbrjsl4HXAvcA+w/UKeBHwBF074mXtrL9R9EGrc6hbfpJbf+7DizfrZU9cS3tdg7wwaGy44AfDx1P321tsytdr3l34KfAG4FH0x1bnwT+E9iyrXsI3TH6l62NDm/tMXh87sf9j9cfAV+iO1Yf2bbzDLr32VQbTrXXxAyv60vba/bi1n5/0l7Xw4eOx5uBI1ud5wA/n6pDNzp4E/CutnwF8DvAr437s7iqTHQL1rCzJLpp6m/bDsont/ld2vybh+p9HTipTb+QLllsOlTnLODdQ9tZ10T36Lb+c6arC3wOOGEt2/7S8PKBmP56hvLhRLe2Npj2+TGQ6Nr8XcALh+oMf3C8ubXn5gN19mx1fr3NH9229ZCBOr9H9yG55Vra4Rzgb9v0W4Hj6U5SDmpl5w0+TwYS3XTPZyj+3x0o266VHTjL6110H74/pfvAugd4+9qO4TZ//VD7fAH4MbDF0DHxqTb9QLqE/pKhbX8GOGtgvoDjhupcDrxtFG3AQKJr8xcDbxmYfxtw0Szv0XNoiY7u5OBJdB/uJw/F8vxp2vHkobIt6E4sntnmvwr801Cdv2Ptie7NdEOZW80Q76FATVM+/LpeTTs2B8qOAa4YOh5PGapzGvDxNr1Ni22/tbXhuB4OXY5Jkr2SfCbJlUl+AvygLXr4UNX/GJr/GrBHm34C3Zna6jYk8tMkPwV+je6MaiShtr81w/L3As9Kckkb1jg4yVyPq/PnWG9tbTBKewDnVnctBICqughYM7S/H9Z9r6H9kK6dfomZnQ08rU0/DThzqizJ1nS9w7PWMe4LB+L9EV2Pd7tZ1rmb7jrsXnQjCn8EvDzJn86y3uWD7UP3Qfu9qrpjqGyqLXaj6118eWg7/879X8MLh+Z/yOzP437rzrEN3g+8KMmyNvz2QuAf57Cfw9p77HbgK3Sv48uG6gwf108AfmfoPXojsCX3vk93pzuBG/RV1u7xwNer6mdziHta7djbkelfn11y3xttZnx9qupm4IPA6UlOS3Jkkketa1yj5s0oY9AOnn+jO5BfRDf8AHAp3YfCXG1Cd9b7O9Msu3V9Yhww9WF0xXQLq+r0JDsDB9KdcX4U+E6S/avq7lm2vc5v0AH3tL8ZKt9sBNueyZ1D81MnAWtL8GcBb2htNZXU7gBeQ/eB+XPu/0G3rvHMFgsAVbVyYPaSJPvQDX2/by2rDd9YUTOUrctJ9HTtOtftzLcNTgLeQTfUugndMPtH57Cfz9ANvd5Jd8Iz3Vdtho/rTdr+3j5N3RvnsM8NxVpfn6o6IsmxdDc2PZ3uuvvLqrv+O1b26MbjMXTXjV5XVedU1eV0Xf/hD2uAfYfmf5VuyAvgG8AjgFuqauXQY71vmW9nun9JN2T07ZnqVdVNVfXxqvpjug+Op9KdoUL35li2nqGsrQ2melYPG4j7l7j/VyDmEselwL5DN1LsSfcheMk8Yx52Hl0v4A3AZFVdR9ej25PuZoavD/WKho2iHWdzN931xFFaSZfQh2+Qeirzb9ORtUFV3QKcTHdN8Ajgk1W1eg6rTr3XfjBDkpvON4D/CfzXNO/Tm1udy+iO60FPnmW73wR+NclWMyy/EyDtRrLptHZYxfSvz5VVNa8T5qq6pKreVVUH0w3Pv3g+6y8Ue3QL60FJ9hoqux34Pt2b/+VJ3kl3nentTD88eHiS79K9WQ6luy7w8rbsn+i+AvCFJK+ju7i9Hd3Q2OVV9dn5BJvkoW1y8OsFvwIcXFX3zLDOW+necJfS9a6eR3fdZ2oo9krgfyV5JN0Q4Lrc2jxjG1TVbUm+Bryq1dmU7hrYcNKYiuM04M6q+vE0+/l7uhtWTkjyN8ByuqHZr1TVV9Yh7v9WVXe2OA8D/qGV3ZTkkvacjp5lE1cCT249wlvprg2tl4HXewu6m4meD3xifbc7qKpuTXIc3dn9DXQ3jTyL7qaJp89zc6Nug/dz77D4U9dzW2vzN3TDmR9tPZ4b6N7zzwSOraorgHcCn0xyPnAq3c0lz59lu+8F/hj4XJKj6IYS9wDurqrT6NoL4LfT3eF7W1X9dJrtvA14Z5JJuuuQTwP+lBm+MjGdJLvRnTD8K901v4fRXUL51ly3sZDs0S2sJ9L1hAYfn20fsofSvdEvpbvo/FfcOww36Ei6s6KL6Q78Q6vqWwBVdTvdG/QbdHea/SfwL8A+dMl0PpYB19K9Wb4JvB44F/gf1b5DN4PbgTe1dabOXA+ue7+r8066mxUuonuDz3aWOp0Z26D5Q7rk+nW6s/QPtOcy6JV0Q4ZXcW8v8D7atZ0D6K5ZXAB8nq7X8ax1iHk6Z9Ml4sFrcWdNUzado+gS7/fo4t957dVnNfV6X0t33LyN7oPzz9Zzu9N5Hd31r3fTteehdK/hmfPczkjboKouAL5Dd43xa+uzrVn2czldb+1BwOl0vbd/pOs9r251PkN3jL6K7jh/HvDqWbZ7LV1C/AldcryU7iQvbfkFwLF0Cf16uhO56byPbqThtS22VwNHVtVsX7UY9DO6640n0x1Pn6Z7Pw5fvxyLtDtmJGlJSbIZ3YnP31bVsWMORwvIoUtJS0q7K/jBdMN+W9GNhqjHTHSSlpqd6a5fXQv8YbshQz3m0KUkqde8GUWS1GtLZuhyzZo1dl0lqecmJibu931ke3SSpF4z0UmSes1EpwU3OTk57hCkRePxvuEx0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXlsxPgG3sJpYvH3cI62zvcQewHtasXj3uECStJ3t0kqReM9FJknrNRCdJ6jUTnSSp10x0kqReM9FJknrNRCdJ6jUTnSSp10x0kqReM9FJknrNRCdJ6jUTnSSp10x0kqReM9FJknrNRCdJ6jUTnSSp10x0kqReM9FJknrNRCdJ6rVFTXRJliX5dpLPt/ldk5yXZGWSf06yeSvfos2vbMt3GdjGa1r595IcOFB+UCtbmeTIxXxekqQN12L36P4cuHxg/h3AMVW1G3AzcHgrPxy4uZUf0+qRZHfgucAewEHAe1vyXAa8BzgY2B34/VZXkrTELVqiS7Ij8BvAB9t8gKcBn2pVTgSe2aYPafO05fu3+ocAJ1fVHVV1JbAS2Kc9VlbVFVV1J3ByqytJWuIWs0f3buBVwD1t/heB1VV1V5tfBezQpncArgZoy9e0+v9dPrTOTOWSpCVu08XYSZLfBK6vqm8m2W8x9rk2k5OT4w5h3vYedwBL1MZ4rGj8PG4W14oVK9a6fFESHfBk4LeTPAPYEtgaOBZYnmTT1mvbEbim1b8G2AlYlWRTYAK4caB8yuA6M5Xfz2yNIk3xWNF8TU5OetxsYBZl6LKqXlNVO1bVLnQ3k5xVVc8Dzgae1aodBnyuTZ/S5mnLz6qqauXPbXdl7gqsAM4HLgBWtLs4N2/7OGURnpokaQO3WD26mbwaODnJW4BvA8e38uOBk5KsBG6iS1xU1aVJPgFcBtwFvLSq7gZI8jLgdGAZ8KGqunRRn4kkaYOUrqPUf2vWrNmon+jE8uXjDmFJWrN69bhD0EbGocvxmpiYyHCZv4wiSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeq1RUl0SbZMcn6Si5JcmuSNrXzXJOclWZnkn5Ns3sq3aPMr2/JdBrb1mlb+vSQHDpQf1MpWJjlyMZ6XJGnDt1g9ujuAp1XVnsBewEFJ9gXeARxTVbsBNwOHt/qHAze38mNaPZLsDjwX2AM4CHhvkmVJlgHvAQ4Gdgd+v9WVJC1xi5LoqvPTNrtZexTwNOBTrfxE4Jlt+pA2T1u+f5K08pOr6o6quhJYCezTHiur6oqquhM4udWVJC1xi3aNrvW8LgSuB84A/gtYXVV3tSqrgB3a9A7A1QBt+RrgFwfLh9aZqVyStMRtulg7qqq7gb2SLAc+Azx6sfY9bHJycly7Xmd7jzuAJWpjPFY0fh43i2vFihVrXb5oiW5KVa1OcjbwJGB5kk1br21H4JpW7RpgJ2BVkk2BCeDGgfIpg+vMVH4/szWKNMVjRfM1OTnpcbOBWay7Lh/SenIkeQDwdOBy4GzgWa3aYcDn2vQpbZ62/Kyqqlb+3HZX5q7ACuB84AJgRbuLc3O6G1ZOWfhnJkna0C1Wj2574MR2d+QmwCeq6vNJLgNOTvIW4NvA8a3+8cBJSVYCN9ElLqrq0iSfAC4D7gJe2oZESfIy4HRgGfChqrp0kZ6bJGkDlq6j1H9r1qzZqJ/oxPLl4w5hSVqzevW4Q9BGxqHL8ZqYmMhwmb+MIknqNROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSem3OiS7Js2cof9Z05ZIkbQjm06M7fobyD4wiEEmSFsKsv3WZ5BFtcpP2Q8qDP6/yCOD2hQhMkqRRmMuPOq+k+2/goftnqYOuA44ecUySJI3MrImuqjYBSPLvVfXUhQ9JkqTRmfM1OpOcJGljNOf/R9euz70V2At40OCyqtp5xHFJkjQS8/nHqx+ju0b3SuDWhQlHkqTRmk+i2wN4clXds1DBSJI0avP5Ht2XgccuVCCSJC2E+fTorgK+mOQzdF8r+G9V9YZRBiVJ0qjMJ9FtBXwe2AzYaWHCkSRptOac6KrqRQsZiCRJC2E+Xy94xEzLquqK0YQjSdJozWfocvCnwKZU+7tsZBFJkjRC8xm6vM8dmkkeChwFfGXUQUmSNCrr/I9Xq+o64BXA20YXjiRJo7W+/2H8UcADRxGIJEkLYT43o3yFe6/JQZfg9gDeNOqgJEkalfncjPLBofmfARdV1eQI45EkaaTmczPKiQsZiCRJC2HO1+iSbJbkjUmuSHJ7+/vGJJsvZICSJK2P+Qxd/i2wD/AnwPeBhwN/DWwN/MXoQ5Mkaf3NJ9E9G9izqm5s899L8i3gIkx0kqQN1Hy+XpB5lkuSNHbzSXSfBP41yYFJHpPkIOCzrVySpA3SfIYuXwW8HngP8DDgGuDjwFsWIC5JkkZi1h5dkicneUdV3VlVb6iq3arqgVW1AtgCeNzChylJ0rqZy9Dla4Evz7DsbOB1owtHkqTRmkui2wv44gzLvgQ8frYNJNkpydlJLktyaZI/b+XbJjkjyWT7u00rT5LjkqxMcnGSxw1s67BWfzLJYQPlj0/ynbbOcUm8SUaSNKdEtzUw05fCNwN+YQ7buAt4ZVXtDuwLvDTJ7sCRwJltGPTMNg9wMLCiPV4MvA+6xEj3r4GeSPedvqOmkmOrc8TAegfNIS5JUs/NJdF9FzhghmUHtOVrVVXXVtW32vRPgMuBHYBDgKmfFjsReGabPgT4SHXOBZYn2R44EDijqm6qqpuBM4CD2rKtq+rcqirgIwPbkiQtYXO56/IY4P1JlgGfrap7kmxCl0jeA/zlfHaYZBfgscB5wHZVdW1bdB2wXZveAbh6YLVVrWxt5aumKZckLXGzJrqq+lj7b+InAlsk+THwYOAO4Kiq+vhcd5bkQcCngVdU1S2Dl9GqqpLUjCuP0OTkxvcPF/YedwBL1MZ4rGj8PG4W14oVK9a6fE7fo6uqdyX5IPAk4BeBG4H/qKpb5hpIks3oktw/VdW/tOIfJdm+qq5tw4/Xt/JrgJ0GVt+xlV0D7DdUfk4r33Ga+tOarVGkKR4rmq/JyUmPmw3MnH8ZpapuqarTq+pj7e98klyA44HLq+pdA4tOAabunDwM+NxA+Qva3Zf7AmvaEOfpwAFJtmk3oRwAnN6W3ZJk37avFwxsS5K0hM3nl1HWx5OB5wPfSXJhK3st8HbgE0kOp/uPCM9py04FngGsBG4FXgRQVTcleTNwQav3pqq6qU2/BDgBeABwWntIkpa4dDcp9t+aNWs26ic6sXz5uENYktasXj3uELSRcehyvCYmJu73Her5/KizJEkbHROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1EJ0nqNROdJKnXTHSSpF4z0UmSes1EJ0nqtUVJdEk+lOT6JJcMlG2b5Iwkk+3vNq08SY5LsjLJxUkeN7DOYa3+ZJLDBsofn+Q7bZ3jkmQxnpckacO3WD26E4CDhsqOBM6sqhXAmW0e4GBgRXu8GHgfdIkROAp4IrAPcNRUcmx1jhhYb3hfkqQlalESXVV9GbhpqPgQ4MQ2fSLwzIHyj1TnXGB5ku2BA4EzquqmqroZOAM4qC3buqrOraoCPjKwLUnSEjfOa3TbVdW1bfo6YLs2vQNw9UC9Va1sbeWrpimXJIlNxx0AQFVVklqs/U1OTi7WrkZm73EHsERtjMeKxs/jZnGtWLFircvHmeh+lGT7qrq2DT9e38qvAXYaqLdjK7sG2G+o/JxWvuM09Wc0W6NIUzxWNF+Tk5MeNxuYcQ5dngJM3Tl5GPC5gfIXtLsv9wXWtCHO04EDkmzTbkI5ADi9Lbslyb7tbssXDGxLkrTELUqPLsnH6XpjD06yiu7uybcDn0hyOPB94Dmt+qnAM4CVwK3AiwCq6qYkbwYuaPXeVFVTN7i8hO7OzgcAp7WHJEmku1Gx/9asWbNRP9GJ5cvHHcKStGb16nGHoI2MQ5fjNTExcb/vUfvLKJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXtt03AGMUpKDgGOBZcAHq+rtYw5J0jqYWL583CGss73HHcB6WLN69bhDWBC96dElWQa8BzgY2B34/SS7jzcqSdK49alHtw+wsqquAEhyMnAIcNlYoxqRvp5pSdPxeNco9aZHB+wAXD0wv6qVSZKWsD4lOkmS7qdPQ5fXADsNzO/YygCYmJjIokckSRq7PvXoLgBWJNk1yebAc4FTxhyTJGnMetOjq6q7krwMOJ3u6wUfqqpLxxyWJGnMUlXjjkGSpAXTp6FLbWCSbJtk23HHIWlpM9FppJLsnOTkJDcA5wHnJ7m+le0y3uikhZFkuySPa4/txh2P7suhS41Ukv8A3g18qqrubmXLgGcDr6iqfccZnzRKSfYC/gGY4N67vHcEVgMvqapvjSs23ctEp5FKMllVK+a7TNoYJbkQ+OOqOm+ofF/g/VW153gi06De3HWpDcY3k7wXOJF7f6lmJ+Aw4Ntji0paGFsNJzmAqjo3yVbjCEj3Z49OI9W+w3g43e+MTv0E2yrgX4Hjq+qOccUmjVqS44BHAh/hvid2LwCurKqXjSs23ctEJ0nrIcnB3PfE7hrglKo6dXxRaZCJTosmyW9W1efHHYekpcWvF2gxPWHcAUiLJcmLxx2DOt6MopFL8mimH8o5anxRSYvOH5LfQNij00gleTVwMt2b/Pz2CPDxJEeOMzZpkd057gDU8RqdRirJfwJ7VNXPh8o3By71e3RaKpL8oKp2HncccuhSo3cP8DDg+0Pl27dlUm8kuXimRYA/BbaBMNFp1F4BnJlkknu/V7QzsBvgd4rUN9sBBwI3D5UH+Prih6PpmOg0UlX1xSS/DOzDfW9GuWDqty+lHvk88KCqunB4QZJzFj8cTcdrdJKkXvOuS0lSr5noJEm9ZqKTeibJOUn+aLHXlTZUJjppA5bkqiT/e9xxSBszE50kqddMdNJGJsk2ST6f5IYkN7fpHYeqPTLJ+UluSfK5JNsOrL9vkq8nWZ3koiT7Le4zkBaXiU7a+GwCfBh4ON2X8W8D/n6ozguAP6T7RZq7gOMAkuwAfAF4C7At8FfAp5M8ZFEil8bARCdtZKrqxqr6dFXdWlU/Ad4KPHWo2klVdUlV/Qz4a+A5SZYBhwKnVtWpVXVPVZ0BfAN4xqI+CWkR+cso0kYmyQOBY4CDgG1a8S8kWTbw6zNXD6zyfWAz4MF0vcBnJ/mtgeWbAWcvbNTS+JjopI3PK4FHAU+squuS7AV8m/v+/7OdBqZ3Bn4O/JguAZ5UVUcsVrDSuDl0KW34Nkuy5dSDrhd3G7C63WQy3T+0PTTJ7q339ybgU62391Hgt5IcmGRZ2+Z+09zMIvWGiU7a8J1Kl9imHsuBB9D10M4FvjjNOicBJwDXAVsCfwZQVVfT/ff31wI30PXw/i9+FqjH/FFnSVKveRYnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeo1E50kqddMdJKkXjPRSZJ6zUQnSeq1/w/bxSwhXQ+h9QAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "label_times['label'].value_counts().plot.bar(color = 'r');\n",
    "plt.xlabel('Label'); plt.ylabel('Count'); plt.title('Label Distribution with Bimonthly Predictions');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:40:02.622506Z",
     "start_time": "2018-10-31T16:40:02.618016Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8525"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_times['label'].isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are quite a few missing labels, which occur when there is no next transaction for the customer (we don't know if the last entry for the customer is a churn or not). We won't be able to use these examples when training a model although we can make predictions for them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parallelizing Labeling\n",
    "\n",
    "Now that we have a function that can make a label times table out of customer transactions, we need to label all of the customer transactions in our dataset. We already broke the data into 1000 partitions, so we can parallelize this operation using Spark with PySpark. The basic idea is to write a function that makes the label times for one partition, and then run this in parallel across all the partitions using either multiple cores on a single machine, or a cluster of machines. \n",
    "\n",
    "The function below takes in a partition number, reads the transactions data from S3, creates the label times table for both prediction problems, and writes the label times back to S3. We can run this function in parallel over multiple partitions at once since the customers are independent of one another. That is, the labels for one customer do not depend on the data for any other customer. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:40:02.832785Z",
     "start_time": "2018-10-31T16:40:02.826593Z"
    }
   },
   "outputs": [],
   "source": [
    "def partition_to_labels(partition_number, prediction_dates = ['MS', 'SMS'], churn_periods= [31, 14],\n",
    "                        lead_times = [1, 1], prediction_windows = [1, 1]):\n",
    "    \"\"\"Make labels for all customers in one partition\n",
    "    Either for one month or twice a month\n",
    "    \n",
    "    Params\n",
    "    --------\n",
    "        partition (int): number of partition\n",
    "        label_type (list of str): either 'MS' for monthly labels or\n",
    "                                  'SMS' for bimonthly labels\n",
    "        churn_periods(list of int): number of days with no active membership to be considered a churn\n",
    "        lead_times (list of int): lead times in number of periods\n",
    "        prediction_windows (list of int): prediction windows in number of periods\n",
    "    \n",
    "    Returns\n",
    "    --------\n",
    "        None: saves the label dataframes with the appropriate name to the partition directory\n",
    "    \"\"\"\n",
    "    partition_dir = BASE_DIR + 'p' + str(partition_number)\n",
    "    \n",
    "    # Read in data and filter anomalies\n",
    "    trans = pd.read_csv(f'{partition_dir}/transactions.csv',\n",
    "                        parse_dates=['transaction_date', 'membership_expire_date'], \n",
    "                        infer_datetime_format = True)\n",
    "    \n",
    "    # Deal with data inconsistencies\n",
    "    rev = trans[(trans['membership_expire_date'] < trans['transaction_date']) | \n",
    "            ((trans['is_cancel'] == 0) & (trans['membership_expire_date'] == trans['transaction_date']))]\n",
    "    rev_members = rev['msno'].unique()\n",
    "    \n",
    "    # Remove data errors\n",
    "    trans = trans.loc[~trans['msno'].isin(rev_members)]\n",
    "\n",
    "    # Create both sets of lables\n",
    "    for prediction_date, churn_days, lead_time, prediction_window in zip(prediction_dates, churn_periods, lead_times, prediction_windows):\n",
    "        \n",
    "        cutoff_list = []\n",
    "            \n",
    "        # Make label times for all customers\n",
    "        cutoff_list.append(make_label_times(trans, prediction_date = prediction_date, \n",
    "                                            churn_days = churn_days, lead_time = lead_time,\n",
    "                                            prediction_window = prediction_window))\n",
    "        # Turn into a dataframe\n",
    "        cutoff_times = pd.concat(cutoff_list)\n",
    "        cutoff_times = cutoff_times.drop_duplicates(subset = ['msno', 'cutoff_time'])\n",
    "        \n",
    "        # Encode in order to write to s3\n",
    "        bytes_to_write = cutoff_times.to_csv(None, index = False).encode()\n",
    "\n",
    "        # Write cutoff times to S3\n",
    "        with fs.open(f'{partition_dir}/{prediction_date}-{churn_days}_labels.csv', 'wb') as f:\n",
    "            f.write(bytes_to_write)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:40:40.844134Z",
     "start_time": "2018-10-31T16:40:03.035551Z"
    }
   },
   "outputs": [],
   "source": [
    "partition_to_labels(1, prediction_dates = ['MS'], churn_periods = [31], \n",
    "                    lead_times = [1], prediction_windows = [1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:40:41.136537Z",
     "start_time": "2018-10-31T16:40:41.035112Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>msno</th>\n",
       "      <th>cutoff_time</th>\n",
       "      <th>label</th>\n",
       "      <th>days_to_churn</th>\n",
       "      <th>churn_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>27232</th>\n",
       "      <td>ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=</td>\n",
       "      <td>2016-10-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27233</th>\n",
       "      <td>ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=</td>\n",
       "      <td>2016-11-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27234</th>\n",
       "      <td>ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=</td>\n",
       "      <td>2016-12-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27235</th>\n",
       "      <td>ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=</td>\n",
       "      <td>2017-01-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27236</th>\n",
       "      <td>ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=</td>\n",
       "      <td>2017-02-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27237</th>\n",
       "      <td>ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=</td>\n",
       "      <td>2017-03-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27238</th>\n",
       "      <td>ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=</td>\n",
       "      <td>2017-04-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27239</th>\n",
       "      <td>zxHN7E8uhYR9vlOKqRoFetJE03BESH+raGiZ8evjaec=</td>\n",
       "      <td>2017-02-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27240</th>\n",
       "      <td>zxHN7E8uhYR9vlOKqRoFetJE03BESH+raGiZ8evjaec=</td>\n",
       "      <td>2017-03-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27241</th>\n",
       "      <td>zxHN7E8uhYR9vlOKqRoFetJE03BESH+raGiZ8evjaec=</td>\n",
       "      <td>2017-04-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               msno cutoff_time  label  \\\n",
       "27232  ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=  2016-10-01    0.0   \n",
       "27233  ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=  2016-11-01    0.0   \n",
       "27234  ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=  2016-12-01    0.0   \n",
       "27235  ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=  2017-01-01    0.0   \n",
       "27236  ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=  2017-02-01    0.0   \n",
       "27237  ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=  2017-03-01    NaN   \n",
       "27238  ztr541XHzi0M0EPSrCSUEytFoqKIbZXNiJhszP8vuGw=  2017-04-01    NaN   \n",
       "27239  zxHN7E8uhYR9vlOKqRoFetJE03BESH+raGiZ8evjaec=  2017-02-01    0.0   \n",
       "27240  zxHN7E8uhYR9vlOKqRoFetJE03BESH+raGiZ8evjaec=  2017-03-01    NaN   \n",
       "27241  zxHN7E8uhYR9vlOKqRoFetJE03BESH+raGiZ8evjaec=  2017-04-01    NaN   \n",
       "\n",
       "       days_to_churn churn_date  \n",
       "27232            NaN        NaN  \n",
       "27233            NaN        NaN  \n",
       "27234            NaN        NaN  \n",
       "27235            NaN        NaN  \n",
       "27236            NaN        NaN  \n",
       "27237            NaN        NaN  \n",
       "27238            NaN        NaN  \n",
       "27239            NaN        NaN  \n",
       "27240            NaN        NaN  \n",
       "27241            NaN        NaN  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_times = pd.read_csv('s3://customer-churn-spark/p1/MS-31_labels.csv')\n",
    "label_times.tail(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:41:28.313923Z",
     "start_time": "2018-10-31T16:40:41.333050Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>msno</th>\n",
       "      <th>cutoff_time</th>\n",
       "      <th>label</th>\n",
       "      <th>days_to_churn</th>\n",
       "      <th>churn_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-01-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-02-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-02-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-03-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-03-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-04-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-04-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-05-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>+31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=</td>\n",
       "      <td>2015-05-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           msno cutoff_time  label  \\\n",
       "0  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-01-01    0.0   \n",
       "1  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-01-15    0.0   \n",
       "2  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-02-01    0.0   \n",
       "3  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-02-15    0.0   \n",
       "4  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-03-01    0.0   \n",
       "5  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-03-15    0.0   \n",
       "6  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-04-01    0.0   \n",
       "7  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-04-15    0.0   \n",
       "8  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-05-01    0.0   \n",
       "9  +31Inxq3oTPJqYZz4kLCoBqxMpVuVeh3lZBS5gM/TJU=  2015-05-15    0.0   \n",
       "\n",
       "   days_to_churn churn_date  \n",
       "0            NaN        NaN  \n",
       "1            NaN        NaN  \n",
       "2            NaN        NaN  \n",
       "3            NaN        NaN  \n",
       "4            NaN        NaN  \n",
       "5            NaN        NaN  \n",
       "6            NaN        NaN  \n",
       "7            NaN        NaN  \n",
       "8            NaN        NaN  \n",
       "9            NaN        NaN  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "partition_to_labels(1, prediction_dates = ['SMS'], churn_periods = [14],\n",
    "                    lead_times = [1], prediction_windows = [1])\n",
    "label_times = pd.read_csv('s3://customer-churn-spark/p1/SMS-14_labels.csv')\n",
    "label_times.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spark for Parallelization\n",
    "\n",
    "The below code uses Spark to parallelize the label making. This particular implementation uses a single machine although the same idea can be extended to a cluster of machines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T16:41:30.905484Z",
     "start_time": "2018-10-31T16:41:28.486308Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><b>SparkContext</b></p>\n",
       "\n",
       "            <p><a href=\"http://ip-172-31-23-133.ec2.internal:4040\">Spark UI</a></p>\n",
       "\n",
       "            <dl>\n",
       "              <dt>Version</dt>\n",
       "                <dd><code>v2.3.1</code></dd>\n",
       "              <dt>Master</dt>\n",
       "                <dd><code>spark://ip-172-31-23-133.ec2.internal:7077</code></dd>\n",
       "              <dt>AppName</dt>\n",
       "                <dd><code>labeling</code></dd>\n",
       "            </dl>\n",
       "        </div>\n",
       "        "
      ],
      "text/plain": [
       "<SparkContext master=spark://ip-172-31-23-133.ec2.internal:7077 appName=labeling>"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import findspark\n",
    "findspark.init('/usr/local/spark/')\n",
    "\n",
    "import pyspark\n",
    "\n",
    "conf = pyspark.SparkConf()\n",
    "\n",
    "# Enable logging\n",
    "conf.set('spark.eventLog.enabled', True);\n",
    "conf.set('spark.eventLog.dir', '/data/churn/tmp/');\n",
    "\n",
    "# Use all cores on a single machine\n",
    "conf.set('spark.num.executors', 1)\n",
    "conf.set('spark.executor.memory', '56g')\n",
    "conf.set('spark.executor.cores', 15)\n",
    "\n",
    "# Make sure to specify correct spark master ip\n",
    "sc = pyspark.SparkContext(master = 'spark://ip-172-31-23-133.ec2.internal:7077',\n",
    "                          appName = 'labeling', conf = conf)\n",
    "\n",
    "sc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T20:04:05.142275Z",
     "start_time": "2018-10-31T16:41:31.101028Z"
    }
   },
   "outputs": [],
   "source": [
    "from timeit import default_timer as timer\n",
    "\n",
    "# Parallelize making all labels in Spark\n",
    "start = timer()\n",
    "sc.parallelize(list(range(1000)), numSlices=1000).\\\n",
    "   map(partition_to_labels).collect()\n",
    "sc.stop()\n",
    "end = timer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While Spark is running, you can navigate to localhost:4040 to see the details of the particular job, or to localhost:8080 to see the overview of the cluster. This is useful for diagnosing the state of a spark operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T20:04:05.338615Z",
     "start_time": "2018-10-31T20:04:05.336262Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "40354 seconds elapsed.\n"
     ]
    }
   ],
   "source": [
    "print(f'{round(end - start)} seconds elapsed.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T20:04:05.669969Z",
     "start_time": "2018-10-31T20:04:05.532570Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>msno</th>\n",
       "      <th>cutoff_time</th>\n",
       "      <th>label</th>\n",
       "      <th>days_to_churn</th>\n",
       "      <th>churn_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>26593</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2016-07-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26594</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2016-08-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26595</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2016-09-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26596</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2016-10-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26597</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2016-11-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26598</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2016-12-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26599</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-01-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26600</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-02-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26601</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-03-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26602</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-04-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               msno cutoff_time  label  \\\n",
       "26593  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2016-07-01    0.0   \n",
       "26594  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2016-08-01    0.0   \n",
       "26595  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2016-09-01    0.0   \n",
       "26596  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2016-10-01    0.0   \n",
       "26597  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2016-11-01    0.0   \n",
       "26598  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2016-12-01    0.0   \n",
       "26599  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-01-01    0.0   \n",
       "26600  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-02-01    0.0   \n",
       "26601  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-03-01    NaN   \n",
       "26602  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-04-01    NaN   \n",
       "\n",
       "       days_to_churn churn_date  \n",
       "26593            NaN        NaN  \n",
       "26594            NaN        NaN  \n",
       "26595            NaN        NaN  \n",
       "26596            NaN        NaN  \n",
       "26597            NaN        NaN  \n",
       "26598            NaN        NaN  \n",
       "26599            NaN        NaN  \n",
       "26600            NaN        NaN  \n",
       "26601            NaN        NaN  \n",
       "26602            NaN        NaN  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels = pd.read_csv(f's3://customer-churn-spark/p980/MS-31_labels.csv')\n",
    "labels.tail(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-10-31T20:04:06.052876Z",
     "start_time": "2018-10-31T20:04:05.873216Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>msno</th>\n",
       "      <th>cutoff_time</th>\n",
       "      <th>label</th>\n",
       "      <th>days_to_churn</th>\n",
       "      <th>churn_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>50916</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2016-11-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50917</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2016-12-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50918</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2016-12-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50919</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-01-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50920</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-01-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50921</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-02-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50922</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-02-15</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50923</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-03-01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50924</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-03-15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50925</th>\n",
       "      <td>zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=</td>\n",
       "      <td>2017-04-01</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               msno cutoff_time  label  \\\n",
       "50916  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2016-11-15    0.0   \n",
       "50917  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2016-12-01    0.0   \n",
       "50918  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2016-12-15    0.0   \n",
       "50919  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-01-01    0.0   \n",
       "50920  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-01-15    0.0   \n",
       "50921  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-02-01    0.0   \n",
       "50922  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-02-15    0.0   \n",
       "50923  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-03-01    0.0   \n",
       "50924  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-03-15    NaN   \n",
       "50925  zuh7jGXSTQLrwHOgh9DDt2FmmsUXszvivNIF3RLSZck=  2017-04-01    NaN   \n",
       "\n",
       "       days_to_churn churn_date  \n",
       "50916            NaN        NaN  \n",
       "50917            NaN        NaN  \n",
       "50918            NaN        NaN  \n",
       "50919            NaN        NaN  \n",
       "50920            NaN        NaN  \n",
       "50921            NaN        NaN  \n",
       "50922            NaN        NaN  \n",
       "50923            NaN        NaN  \n",
       "50924            NaN        NaN  \n",
       "50925            NaN        NaN  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels = pd.read_csv(f's3://customer-churn-spark/p980/SMS-14_labels.csv')\n",
    "labels.tail(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusions\n",
    "\n",
    "In this notebook, we implemented prediction engineering for the customer churn use case. After defining the business need, we translated it into a task that can be solved with machine learning and created a set of label times. We saw how to define functions with parameters so we could solve multiple prediction problems without needing to re-write the entire code. Although we only worked through two problems, there are numerous others that could be solved with the same data and approach.\n",
    "\n",
    "\n",
    "The label times contain cutoff times for a specific prediction problem along with the associated label. The label times can now be used to make features for each label by filtering the data to before the cutoff time. This ensures that any features made are valid and will automatically be taken care of in Featuretools. \n",
    "\n",
    "The general procedure for making labels is:\n",
    "\n",
    "1. Define the business requirement: predict customers who will churn during a specified period of time\n",
    "2. Translate the business requirement into a machine learning problem: given historical customer data, build a model to predict which customers will churn depending on several parameters\n",
    "3. Make labels along with cutoff times corresponding to the machine learning problem: develop functions that take in parameters so the same function can be used for multiple prediction problems.\n",
    "4. Label all past historical data: parallelize operations by partitioning data into independent subsets\n",
    "\n",
    "This approach can be extended to other problems. Although the exact syntax is specific to this use case, the overall approach is designed to be general purpose.\n",
    "\n",
    "## Next Steps\n",
    "\n",
    "With a complete set of label times, we can now make features for each label using the cutoff times to ensure our features are valid. However, instead of the painstaking and error-prone process of making features by hand, we can use automated feature engineering in [Featuretools](https://github.com/Featuretools/featuretools) to automated this process. Featuretools will build hundreds of relevant features using only a few lines of code and will automatically filter the data to ensure that all of our features are valid. The feature engineering pipeline is developed in the `Feature Engineering` notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "position": {
    "height": "501px",
    "left": "781px",
    "right": "20px",
    "top": "574px",
    "width": "800px"
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
